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Preface 



The papers collected in this volume were presented at the International Con- 
ference on Rough Sets and Current Trends in Computing - RSCTC^98 held in 
Warsaw, June 22-26, 1998. The conference was intended as a forum for exchang- 
ing ideas among experts in various areas of soft computing and researchers in 
rough set theory and its applications - a branch of soft computing showing a 
significant growth in years - in order to stimulate mutual understanding and 
cooperation. 

It is our pleasure to dedicate this volume to Professor Zdzislaw Pawlak who cre- 
ated the theory of rough sets. The present state of this theory and its applications 
owes very much to Professor Pawlak^s enthusiasm and support. 

The conference was held under the patronage of the Committee for Computer 
Science of the Polish Academy of Sciences as one of the events celebrating the 
50th Anniversary of starting Computer Science in Poland and for this we wish 
to express our thanks to Professor Stefan W§grzyn, chairman of the committee. 



We wish to express our gratitude to Professors Edward A. Feigenbaum, Zdzislaw 
Pawlak, Carl A. Petri, and Loth A. Zadeh who accepted our invitation to act as 
honorary chairs of the conference. 

The conference was held in the Barnabite Cultural Center in Warsaw and it 
was sponsored by grants from the Polish State Committee for Scientific Re- 
search (KBN) and by the Institute of Mathematics at Warsaw University, the 
Department of Electronics and Information Techniques at Warsaw University 
of Technology, the Polish- Japanese Institute of Computer Techniques and the 
Japan International Cooperation Agency (JICA). 

The following major areas were selected for RSCTC^98: 

— Rough Set Theory and Applications 

_ Set Theory and Applications 

— Knowledge Discovery and Data Mining 

— Decision Support Systems 

— Evolutionary Computations 

— Neural Networks 

— Computing with Words and Granular Computations 

— Grammar Systems and Molecular Computations 

— Petri Nets and Concurrency 

— Complexity Aspects of Soft Computing 

— Pattern Recognition and Image Processing 

— Statistical Inference 

— Logical Aspects of Soft Computing 

— Applications of Soft Computing 




We wish to express our thanks to members of the advisory board: M. Drzewiecki, 
J. W. Grzymala-Busse, T.Y. Lin, K. Malinowski, T. Munakata, A. Nakamura, 
S. Ohsuga, F. Petry, Z. W. Ras, G. Rozenberg, S. Shimada, R. Slowihski, H. 
Tanaka, B. Trakhtenbrot, S. Tsumoto, P.P. Wang, S. W§grzyn, W. Ziarko for 
their contribution to the scientific program of the conference. 

The accepted papers selected from 90 draft papers were divided into regular com- 
munications (allotted 8 pages in this volume) and short communications (allotted 
4 pages in this volume) on the basis of reviewer grades. The process of reviewing 
rested with members of the program committee: R. Agrawal, Shun-ichi Amari, 
Th. Baeck, M.C.F. Fernandez-Baizan, J. Bielecki, H.-D. Burkhard, G. Catta- 
neo, M. Chakraborty, N. Cercone, J.-C. Cubero, A. Czyzewski, J. Doroszewski, 
D. Dubois, I. Duentsch, M. Grabish, J. Kacprzyk, T. Kaczorek, W. Kloesgen, 
J. Komorowski, J. Koronacki, W. Kowalczyk, M. Kryszkiewicz, P. Lingras, T. 
Luba, S. Marcus, Ak W. Marek, Z. Michalewicz, R. Michalski, M. Moshkov, Son 
H. Nguyen, M. Novotny, E. Orlowska, K. Oshima, P. Pagliani, S. K. Pal, Gh. 
Paun, W. Pedrycz, J. F. Peters, A. Pettorossi, Z. Piasta, F. G. Pin, L. Polkowski, 
H. Prade, Ak Raghavan, B. Reusch, H. Rybihski, R. Schaefer, W. Skarbek, A. 
Skowron (chair), K. Slowihski, J. Stefanowski, J. Stepaniuk, R. Swiniarski, A. 
Szalas, R. Tadeusiewicz, H. Thiele, W. Traczyk, D. Aakarelov, H.-M. Abigt, A. 
Wasilewska, A. Wierzbicki, S.K.M. Wong, S. Yamane, Y. Y. Yao, J. Zabrodzki, 
Ning Zhong, J. Zytkow. 

We would like to acknowledge help in reviewing from J. Dassow, C. J§drzejek, 
O. Pons Capote, L. Rudak, D. Sl§zak, Z. Suraj, M. Szczuka, S. Wierzchoh, P. 
Wojdyllo, J. Wroblewski. 

Invited lectures were presented at the conference by Professors Rakesh Agrawal, 
Edward A. Feigenbaum, Wolfgang Haerdle, Willi Kloesgen, Solomon Marcus, Ab 
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Deviation and Association Patterns for 
Subgroup Mining in Temporal, Spatial, and 
Textual Data Bases 

Willi Klosgen 

German National Research Center 
for Information Technology - GMD 
D-53757 St. Augustin, Germany 
emaihkloesgen@gmd.de 



[ Abstract.] Data mining is usually introduced as search for 
interesting patterns in data. It is often an explorative step it- 
eratively performed within a process of knowledge discovery 
in data bases (KDD). A mining step typically relies on strate- 
gies for systematic search in large hypotheses spaces guided 
by the autonomous evaluation of statistical tests. We describe 
the subgroup mining approach that is based on deviation and 
association patterns. A typical database contains values of at- 
tributes for many objects (persons, transactions, documents). 
Interpretable subgroups of these objects are searched that de- 
viate from a designated expected behavior. Many types of data 
analysis questions can be answered by subgroup mining with 
diverse specializations of general deviation and association pat- 
terns. Tests measure the statistical interestingness of subgroup 
deviations. After summarizing the approach by discussing the 
fundamental components of subgroup pattern classes concern- 
ing validation, search and interactive presentation of pattern 
instances, we explain how deviation patterns of subgroup min- 
ing are applied for temporal, spatial and textual databases. 



1 Introduction: The pattern paradigm 

The paradigm of large scale and systematic search for patterns serves as a funda- 
mental idea for introducing data mining. So most informal definitions describe 
data mining as that step in a KDD process in which valid, novel, potentially 
useful, and understandable patterns are searched in data (Fayyad et ah, 1996). 
A pattern is seen as a statement A in a high level language that describes a 
subset D{S) of a data base D with a quality q[S) (Klosgen and Zytkow, 1996). 
Since a tremendous number of subsets exists in a data base for which a huge 
collection of quite different types of statements can be checked, a problem ori- 
ented focusing is necessary when setting a data mining step. Typically the user 
of a data mining system, when directing a mining step, specifies the type of the 
statements and several constraints on the data subsets. Thus the search space 
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of a mining step contains similar statements of a selected type on each of the 
subsets implicitly selected by the specification of constraints. 

Similar statements can be considered as instances of a special pattern class. A 
pattern class encompasses the generic properties of its instances. Therefore, the 
definition of the generic properties is constitutive for a pattern class providing the 
methods by which all the instances of the class are constructed and processed. 
Three main properties are treated in this paper that determine how pattern 
instances are validated, searched and presented. 

A pattern class is defined by its meaning, i.e. the statistical content or an- 
alytic question, given for instance by a model or hypothesis type. An abstract 
analytic question is operationalized by the validation method including a verifi- 
cation and a quality computation part. A pattern instance is statistically verified 
in an associated subset of data, e.g. by testing an hypothesis, and its quality is 
computed by optionally considering also non-statistical aspects of interesting- 
ness. 

Other generic properties refer to the presentation schemes, the search di- 
mensions of the instance space, and the search control and pruning possibilities. 
Presentation schemes especially must determine an appropriate visualization of 
a statement showing its significance and quality figures and their components. 
Moreover, presentation of a set of interdependent statements and the exploratory, 
interactive operations that a user can perform on these presentations have to be 
designed. Based on a description of the search dimensions and pruning options 
of a pattern class, a general search algorithm can efficiently organize and process 
the search space. For a given analytic question, several options can be selected for 
verification and quality computation, presentation in natural language or graph- 
ical form, and search control. For instance, diverse statistical tests can check, if 
the probability of a binary event deviates in a subgroup. 

Based on the generic properties defining a pattern class, a pattern instance 
is treated in a mining system as a node in a search space which is processed by 
a search algorithm, a hypothesis or a model on a subset of the data base tested 
and evaluated by a verification and a quality method, and a statement which is 
presented to the user. 

Since the pattern paradigm refers to a very broad definition, many mining 
tasks can be captured within this framework. Some simple patterns illustrating 
the definition are included in Table 1 , and some model- type patterns are shown 
in Table 2. An important KDD pattern class, the subgroup paradigm, is then 
treated in the rest of this paper, where we will discuss in detail the constitutive 
aspects of subgroup mining given by the generic properties of this pattern class. 

2 Interesting subgroups: Two general pattern classes 

The language for building pattern instances has not been elaborated in the 
introductory pattern definition. Hence many specializations of this definition 
are possible, as the examples in Table 1 and Table 2 show. KDD usually deals 
with given data bases that contain data on objects such as clients, transactions. 
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Covering sets of attributes 


Key attributes in a database 


Meaning 


At least c records of database 
assume the (positive) values for 
all (binary) attributes of the set. 


There are no different records in 
data base that assume the same 
values for all key attributes. 


Validation 


simple arithmetic calculation 


simple arithmetic calculation 


Presentation 


symptom 1, symptom9, 
symptom27, appear together 
in 23% of cases 


{attl, attS, attl3} is a set of 
key attributes 


Instance space 


all subsets of attributes 


all subsets of attributes 


Search control 


set covering 

^ each subset covering 


set is key ^ each superset is key 



Table 1. Generic properties for two simple patterns 





Regression model 


Conceptual cluster 


Meaning 


Describes a linear relation 
between a dependent and a 
set of independent variables. 


A set of similar cases described 
by the predictive and predictable 
values of normative attributes. 


Validation 


verification: E-test 
quality: adjusted R square 


category utility measure 
e.g. Cobweb, see (Fisher, 1987) 


Presentation 


log Salary = 3.85 + 0.001 Age + 
0.103 Sex +0.3 Education Level 


Toxic Waste - yes (0.81,0.90) 
Budget Cuts - yes (0.81,0.81) 
SDI Reduction - no (0.93,0.88) 


Instance space 


all subsets of independent 
attributes 


all subsets of cases 


Search control 


forward selection, backward 
elimination, stepwise regression 


hill climbing 



Table 2. Generic properties for two model- type patterns 



documents. Therefore, one obvious pattern language describes subgroups of these 
objects. A subgroup is an interpretable and analysis relevant subset of the object 
population. This specialization of the general pattern approach can informally be 
characterized as searching for interesting subgroups. A first generation subgroup 
mining system is the Exp lora system (Klosgen; 1992, 1996). The subgroup mining 
task has more formally been introduced by Siebes (1995) as a basis for the 
development of the Data Surveyor mining system. 

The motivation for the subgroup approach is given by a frequent data analytic 
goal where an analyst is interested in a special property or behavior of one or 
several selected target variables. Those regions of the input variable space are 
searched where the target variables show this behavior. The analyst could e.g. 
be interested in regions with a high average value of a selected continuous target 
variable, or with a high share of one value of a binary target, or in regions 
for which this share has significantly more increased between two years than 
in the complementary region, or in regions that show a special time trend of a 
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target variable. Thus many different data analytic questions can be represented 
as special behavior types of target variables. 

The usual approach applied for many of these data analysis relies on an ap- 
proximation of the unknown function that describes the dependency between the 
target variable(s) and the input (independent) variables. The function derived 
as an approximation is then studied to identify the interesting regions. However, 
a good approximation is often difficult to find, especially when there are many 
input variables, a global approximation (in the whole input space) has to be 
derived, and a probabilistically based approximation must be found with a sam- 
ple of noisy data. So often the direct approach of searching for the interesting 
subgroups without relying on an intermediary functional approximation is more 
efficient (Friedman and Fisher, 1997). 

To formalize the subgroup approach, a specification of the description lan- 
guage to build subgroups (Section 3) and an operationalization of behavior pat- 
terns are needed. The behavior of target variables y = (^i, is captured 

by assuming a probabilistic approach and referring to their joint distribution 
with the input variables x = (xi, ...,x^). One is interested in some designated 
property of the unknown joint distribution, respectively of the density function 
p(y,x). The interesting behavior is now defined by a statistical test. In the null 
hypothesis of such a test, the assumption (the property or behavior) for the 
distribution of the target variables in the subgroup is specified that is regarded 
as expected or uninteresting. The alternative hypothesis defines the deviating 
or interesting subgroup referring to the property of interest for the distribution. 
When a given data set spots the null hypothesis for a subgroup as very unlikely 
(under a given confidence threshold), the subgroup (i.e. the behavior of the tar- 
get variable(s) in the subgroup given by the distribution property) is identified 
as interesting. 

The test approach has three advantages for subgroup mining. It allows a 
broad spectrum of data analytic questions to be treated, offers intelligent solu- 
tions to balance the trade-off between diverse criteria for assessing the statistical 
interestingness of a deviation, e.g. size of subgroup versus amount of deviation, 
and finally mitigates the problem of just discovering random fluctuations of the 
target variables in a given noisy sample data base as interesting. 

Besides searching for interesting subgroups, a second general mining task 
consists of searching for interesting pairs of subgroups. Interestingness is then 
evaluated by association measures for the two subgroups of a pair. Association 
rules (Agrawal et ah, 1996) or sequences (Mannila et ah, 1997) are typical ex- 
amples of this second general subgroup mining pattern. Mostly these association 
measures rely on an evaluation of the 2x2 cross table constructed for the two 
subgroups. In case of time stamped objects, time relations between objects (e.g. 
successor) can underlie the cross table calculation and special association mea- 
sures (e.g. for sequences). 

The generic components of subgroup patterns include a description language 
to construct subgroups, a verification method to test the statistical significance 
of a subgroup, quality functions to measure the interestingness of a single sub- 
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group and of a set of subgroups, constraints limiting the space of admissible 
subgroups, and search goals and controls defining additional properties of the 
subgroups to be found. Interactive visualization of individual subgroups and sets 
of interdependent subgroups has to be fixed in the presentation component of 
a pattern. The mining task consists in applying a search strategy to discover 
the interesting subgroups which is appropriate for the given search goals and 
constraints and to present the results to the user so that she can operate on the 
presentation. In a very similar way, the generic properties can be introduced for 
pattern classes that deal with searching for interesting pairs of subgroups. 

3 Description languages to build subgroups 

Description languages for subgroup mining are mostly conjunctive, propositional 
languages. So we assume that the data base consists of one or several rela- 
tions, each with a schema {Ai, A_ 2 , ..., and associated domains Di for the 
attributes A^. A conjunctive, propositional description of a subgroup is then 
given by: (Ai G Di) A . . . A (A^ G VA) with Vi C Di. E.g. age G [18, 25] Aregion G 
{50, 50, 57V, RE}. Conjunctive selectors with Vi = Di can of course be omit- 
ted in a description. 

Single relation languages allow only to analyze a single relation, possibly 
constructed in a preprocessing phase by a join of several relations. Then only 
the attributes belonging to the schema of this single relation can be used for 
subgroup descriptions. Multirelational languages do not require a preprocessing 
join operation and allow to build descriptions with attributes from several re- 
lations. They are especially useful for applications that require different target 
object classes with flexible joins to be analyzed. In the Kepler system, we use 
the MID OS subgroup miner (Wrobel, 1997) to discover subgroups of objects 
of a selected target relation. A description may include selectors from several 
relations, which are linked by foreign link attributes. 

A data base, for instance, includes relations on hospitals, patients, diagnoses 
of patients, and therapies for patient-diagnoses with obvious foreign links such 
as patient-id linking patient-diagnoses and patients. The analyst chooses a tar- 
get object class, e.g. patients, and can decide on the other object classes, e.g. 
hospitals and patient-diagnoses, and their attributes to be used for building sub- 
groups of patients. Such a subgroup could e.g. be described by male patients with 
a cancer diagnosis treated in small hospitals. 

By these foreign links and the implicit existential quantifier used for linking 
relations (e.g. patients with at least one diagnosis of a type), a very limited 
Inductive Logic Programming approach is applied, extending the simple one 
relational propositional approach. The full ILP approach has not (yet) been 
used for subgroup mining. 

A next aspect for specializations of description languages refers to taxonomies 
that can be used for subgroup descriptions. To restrict the number of descrip- 
tions, usually not every subset of the domain of an attribute Ai is allowed in a 
description. A taxonomy Hi consisting of a set of subsets of Di holds the allowed 




6 



W. Klosgen 



subsets. A taxonomy is hierarchically arranged, i.e. the subsets are partially or- 
dered by inclusion. Usually Hi will be much smaller than the power set of Di. 
A taxonomy for an attribute can explicitly and statically be selected by the 
user for the description language of a mining task, or dynamically and implicitly 
determined by a special subsearch process that generates and evaluates certain 
subsets of attribute values during a mining task. 

Restricting the description space by taxonomies is not only important to 
reduce the search effort for a mining task, but also to produce descriptions which 
are simpler, present the subgroups on an appropriate hierarchical level, and are 
relevant for the application domain by avoiding faked or nonsense subsets Vi . 
With a taxonomy, implicit internal disjunctions are introduced. For attributes 
with many values, elementary selectors built with a single value and added as 
a further conjunct to a subgroup will often lead to a small resulting subgroup 
which possibly has a too small statistical significance, so that results can only 
be found on more general levels. And finally taxonomies are also important for 
the effectiveness of search algorithms: they can avoid too greedy algorithms and 
allow more patient search strategies (Section 5). 

User defined statical taxonomies usually are global and not adjusted to an 
analytic question or pattern type. To generate taxonomies dynamically for a 
subgroup mining task, appropriate subsets of values must be found. Dependent 
on the type of an attribute Ai , diverse methods are used for automatically gen- 
erating value subsets for Ai. Nominal, ordinal, or continuous attributes are dis- 
tinguished. Sometimes background knowledge on the domain Di can be used for 
a taxonomy construction (see Section 8). Most work in taxonomy construction 
relates to the continuous case and the discretization problem; see (Dougherty et 
ah, 1995) for an overview. 

In subgroup mining, we mainly need supervised methods that exploit the 
joint distribution with the target variables selected for a special mining task. 
Supervised discretization methods have been developed for the classification pat- 
tern type and usually deal with a binary target variable. Then discretizations 
shall be generated that optimize the classification accuracy of e.g. a decision 
tree. Unsupervised methods usually rely only on the univariate distribution of 
the attribute for which a taxonomy shall be derived. Quantiles and other simple 
methods, but also density based or clustering approaches which may also exploit 
all other attributes (not particularly the target attributes) are used. 

Next, we distinguish global and local methods. Global methods find a tax- 
onomy independent from a subgroup which is being expanded by a conjunctive 
selector. In the context of subgroup mining and mostly not homogeneous data 
bases, local methods are preferable. For example, a taxonomy derived in the con- 
text of expanding the local subgroup males could be different from a taxonomy 
for the subgroup females. 

Top down methods generate taxonomies by recursive specializations of al- 
ready found sets on each hierarchical level. For ordinal (and continuous) at- 
tributes, a best splitting point into two (or more) intervals is found on each level. 
However, recursive splitting does not necessarily find the best multi-interval split. 
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Bottom-up methods generalize the sets found already on a hierarchical level to 
create the sets on the next higher level, e.g. by merge techniques. 

For subgroup mining, we are mainly interested in supervised, local methods 
that are adjusted to the pattern type. Bottom-up methods can be problematic, 
because of the small significance of the lower levels in a local context. Time 
efficiency is another criterium. Local taxonomy derivation is of course extremely 
time-consuming. Further, it is preferable to use the same evaluation framework 
for taxonomy finding as for the pattern type dependent evaluation of subgroups. 
A too sophisticated mixture of methods might be difficult to understand for an 
user. We have considered these requirements when implementing the taxonomy 
finder in Data Surveyor, 

4 Validation of subgroup patterns 

The general subgroup pattern (Section 2) can be specialized in various ways. 
These special analytic questions can be classified with two dimensions. At first, 
the type of the target variable is important. If this is a binary variable, a single 
percentage is analyzed for a subgroup, e.g. the percentage of good productions (or 
of complementary not good productions) . In case of a nominal variable, a whole 
vector is studied, e.g. the percentages of bad, medium, and good productions. 
When the target variable is ordinal, the probability of a better value in the 
subgroup than in a reference group can be analyzed. E.g. the probability that a 
subgroup of productions has a better quality than all the productions. Finally, 
the target variable can be continuous (interval or ratio type). Then statements 
on the median or mean value of the variable can be inferred. If several target 
variables are selected, their joint distribution in a subgroup is analyzed. 

The second dimension for classifying analytic questions refers to the number 
of studied populations which can e.g. relate to several time points or countries. 
When the database contains one cross section (i.e. one population of objects), 
the subgroup is usually compared for deviations with the whole population or 
some root or parent subgroup, resp. with its complementary subgroup. In case of 
two independent cross sections or of a panel of a population, the latter including 
the same objects for two (or more) time points, the change of the distribution 
of the target variables is analyzed for a subgroup. Next, k cross sections can be 
analyzed, for instance data for k time points or countries. Finally, a database 
includes a time series of populations when the segmentation attribute that gen- 
erates the k cross sections is ordinal with equidistant values. 

A pair of verification and quality functions is used to evaluate the devia- 
tion of a subgroup. The verification method operationalizes a special analytic 
question by a statistical test. Table 3 lists some tests according to the above 
classification of analytic questions that are offered in Explora^ Kepler and Daia. 
Surveyor. When relying on parametrical tests in subgroup mining, a property 
of one of the distribution parameters of the target variable(s) in the subgroup 
determines the meaning (semantics) of an analytical question. For continuous 
target variables, nonparametrical tests are appropriate in the data mining con- 
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text, because the smaller test power (adhering longer to the null hypothesis) is 
mostly not a problem, and the modest distribution assumptions and calculation 
efforts of these tests are preferable. Also the usually large sample and explorative 
situation in data mining favors non-parametrical tests. The verification method 
is used as a filter constraint to subselect pattern instances. Thus only deviations 
are selected that have a very low probability of being generated just by random 
fluctuations of the target variables. 



Type of 
dep endent 
variable(s) 


One 

cross section 


Two independent 
cross sections 


k independent 
cross sections and 
time series 


Binary 


binomial test 
chi square test 
confidence intervals 
information gain 


bin.test:pooled variance 
chi square test 
log odds ratio: 2 ;-scores 
(each with absolute / 
relative version) 


chi square tests 
trend test 


Nominal 


chi square: 
goodness of fit 
independence test 
Gini diversity index 
information gain 
twoing criterium 


chi square tests 
Gini diversity index 


chi square test 
trend analysis 


Ordered 


ridit analysis 


ridit analysis 


ridits & trend analysis 


Continuous 


median test 
median-quantile test 
G-test 
H-iesi 

1 or 2 sample t-test 


median test 
median-quantile test 
G-test 
H-iest 

two sample t-test 


analysis of variance 



Table 3. Some statistical verification tests for subtypes of the subgroup pattern 



The quality function is used by the search algorithm to rank the instances. For 
instance, in a beam search strategy (Section 5), only the best n pattern instances 
according to their quality value are further expanded. The quality computation 
can relate on statistical and other interestingness aspects such as simplicity, 
usefulness, novelty. It can directly be given by the significance value or test 
statistic of the verification method, or by a function exploiting this significance 
value as one component for the final quality. A typical statistical quality function 
(e.g. defined by z-scores) combines several aspects of interestingness such as 
strength (deviation of parameter from a-priori value) and generality (size of the 
subgroup). 
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5 Search 

Large scale and mostly brute force search is the core of data mining algorithms. 
In a sense, this approach mimics the procedure of data analysts when looking at a 
set of cross-tabulations to find interesting cells or when generating a sequence of 
statistical (e.g. regression) models. Due to the limited manual analysis capacities, 
a computerized brute force search can be organized more systematically and 
more completely to cover large parts of hypotheses spaces. 

Search can be exhaustive or heuristic. An exhaustive strategy prunes only 
hypotheses that can not belong to the solutions, whereas heuristic strategies like 
hill-climbing and its extensions (tabu search, simulated annealing), beam-, tree- 
, or stepwise search aspire to process prospective regions of the search space, 
but cannot exclude that there are (better) solutions in the pruned subspaces. 
Often the description language cannot be restricted in a way that allows ex- 
haustive search (e.g. by limiting the maximal number of conjunctions, applying 
only coarse discretizations and taxonomies). Another possibility to avoid com- 
binatorial explosion is to apply constraints on the search space. Ideally, such 
constraints represent domain knowledge, preventing also from discovering many 
uninteresting descriptions. 

Many search strategies exploit the partial order that is associated to a space 
of descriptions. Descriptions can be partially ordered by the generality of the 
intensional descriptions or by the subset relation of their extensions (set of all 
objects that satisfy the description). Genetic search strategies apply genetic op- 
erators like mutation and cross-over to produce a new generation of descriptions 
which iterates in an evolutionary process to a set of high quality descriptions. 
Parallel approaches distribute search onto subsearches that can be scheduled in 
parallel. 

Search can be scheduled in several phases. In a first brute force phase, all 
subgroups are determined that satisfy the given constraints and search goals. 
In a second refinement phase, redundancies are eliminated and selected sub- 
groups are elaborated. Redundancies relate mainly to the correlation between 
subgroups which may be responsible for spurious effects. Elaborations analyze 
the homogeneity of subgroups to avoid e.g. that not the subgroup as a whole is 
relevant but a subset of the subgroup. Brute force and refinement phases can be 
scheduled iteratively. 

Search strategies iterate over two main steps: validating hypotheses and 
generating new hypotheses. Operating on a current population of hypotheses, 
neighborhood operators generate the neighbors of hypotheses, e.g. by expand- 
ing hypotheses with additional conjunctive terms, or genetic operators create a 
next generation of hypotheses by mutation and crossover operations. Both the 
validation and the generation step consist of four substeps. At first promising 
hypotheses are selected from the list of not yet validated, newly generated off- 
springs, respectively from the list of not yet further expanded, newly validated 
hypotheses. Then the selected hypotheses are validated by applying a statistical 
verification and quality computation module, resp. expanded by applying neigh- 
borhood or genetic operators creating new hypotheses. In the third substep, the 
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newly validated or generated hypotheses are jointly evaluated, e.g. to identify 
solutions or check pruning possibilities. Finally, the populations of hypotheses 
are updated. 



Search Step 


Beam Search 


Broad View 


Best n 


Patient 


Select hypos 
for validation 
from list of 
generated, not 
yet validated 
hypotheses 


all 


all 


all 


all 


validate 


apply verification test and quality computation | 


evaluation of 

validated 

hypos 


sort successfully 
verified, not 
prunable hypos 
(cover constr.) 
by quality and 
put best n on 
list of hypos to 
be expanded 


put not success- 
fully verified, 
not prunable 
hypotheses 
(cover constr.) 
on list of hypos 
to be expanded 

put successfully 
verified hypos 
on result list 


update list of 
best n hypos 
with successfully 
verified hypos. 
Put not prun- 
able hypos 
(cover constrain, 
optimistic 
estimate) on list 
of hypos to be 
expanded 


sort successfully 
verified, not prun- 
able hypos by 
quality and put 
best one on list of 
hypos to be expan- 
ded. If no better 
hypo, repeat 
process, but 
eliminate all cases 
covered by found 
subgroups 


update list of 
hypos not yet 
validated 


not applicable: all have been validated 


select hypos 
to expand 


all 


all 


all 


all 


expand hypos 


dependent on type of expansion attribute: 
discretization regional clustering 


eliminate 1 
internal disjunc- 
tion/quantile 


evaluation of 

expanded 

hypos 




eliminate 
successors 
of results 






update list of 
hypos to be 
expanded 


not applicable: all have been expanded 



Table 4. Four simple brute force search strategies for subgroup mining 



The validation and generation steps iterate until the solutions are found or 
the space of hypotheses is exploited. Search strategies fix the details within 
this general search frame, e.g. the order in which the hypotheses are evaluated, 
expanded and validated, the selection and pruning criteria, and the iteration, 
recursion or backtracking of the search. In Table 4, these steps are summarized 
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for search strategies implemented in Data Surveyor. These strategies take sim- 
ple decisions in most of the steps. All these strategies perform a brute force 
search to identify a set of hypotheses with high quality. Whereas beam search 
at each step only expands the best hypotheses to find more specialized, better 
subgroups, the broad view strategy is complementary. If a high quality subgroup 
is found, it is not further expanded. So subgroups can be identified that consist 
of a conjunction of selectors, where each selector alone is not interesting. The 
"best n" strategy is exhaustive, so that an efficient pruning is necessary for large 
hypothesis spaces. This can be achieved by a restrictive cover constraint (requir- 
ing a relatively large size of a subgroup). The optimistic estimate evaluation of 
a subgroup checks, if any specialization (expansion) can have a higher quality 
than the worst of the currently best n hypotheses. 

Another aspect of a search strategy relates to its greediness. The usual gen- 
eral to specific search over subgroups realized by successively adding further 
conjuncts is very greedy, i.e. the size of the next subgroup is much reduced 
by a further conjunct. Especially for hill climbing strategies, this is a problem. 
Friedman and Fisher (1997) therefore propose a patient strategy based on a 
description language offering all internal disjunctions for categorical variables 
and (high) quantiles for continuous variables. At each specialization step, one 
internal disjunction is eliminated or one upper or lower quantile is taken away 
from the current interval for the variable. So only a small part of the objects of 
a current subgroup is reduced in a specialization step. 



6 Navigation and Visualization 

The kind and extent of user involvement into a data mining step variates de- 
pendent on applications and user preferences. Subgroup mining systems differ 
in the degree of autonomy that is incorporated in the system by the parame- 
terization of decision processes and treatment of trade-offs between evaluations 
aspects. Involving the user interactively and iteratively in the mining process is 
often necessary to ensure that the mining results best serve the particular data 
analytic goals. Then a user centered search is incrementally scheduled, where the 
user assesses individual results by judging various trade-offs and compiles more 
or less manually a consolidated set of subgroups. These judgments on trade-offs 
are best made by an user if they are highly application dependent. Visualizations 
can help to compare diverse alternatives so that the user can select the most ap- 
propriate ones for the current application. This incremental search is directed by 
intermediate results and by operations of the user on interactive visualizations. 

User directed incremental and iterative search can be supported by naviga- 
tion operators to specify search processes that are run in subspaces of a multi- 
dimensional hypothesis space, compare their results, and redefine search tasks. 
Visualization of search results is important for these navigation operators. The 
analyst should be able to operate on the presented results to perform compari- 
son, focusing, explanation, browsing and scheduling operations. 
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When relying on visualization approaches only, the user must identify the 
patterns (regional clusters, concurrencies of lines, emergent groups of associa- 
tions) in the presentations. Because of well-established visual capabilities, it is 
of course much easier for an analyst to detect these patterns in the presented 
visualizations than in the numerical raw data. Data mining methods detect these 
patterns more autonomously, e.g. by searching and evaluating clusters of neigh- 
boring regions. A statistical test ensures that such a cluster is not a random 
result. Although the eye is quite efficient in detecting any regularities, the sit- 
uation is not quite as easy. Often the user sees patterns in the visualizations 
that are not really statistically valid, or ignores existing patterns. Therefore a 
combination of data mining and visualization approaches is important. 

Presentation issues deal with four aspects: how can a single pattern instance 
be appropriately visualized, how can a set of interconnected patterns be visu- 
alized, which interactive operations can be performed on a presentation graph, 
and which additional visualizations are important to explain the results, compare 
trade-offs and support an explorative analysis. 

Patterns and sets of patterns must be presented in textual and graphical form 
to the user. A set of patterns can often be represented as a graph referring to the 
partial ordering of the subgroups (ordered by generality). Associations between 
subgroups imply a graph structure as well. Various operations on these graphs 
allow the user of a mining system to redirect a mining task, to filter or group 
mining results, and to browse into the data base. Thus these graphs provide an 
interaction medium for the user based on interactive visualization techniques. 

Besides the interactivity of operations on the visualized subgroup results, 
the pattern specific presentation of a single instance has to be designed. Text 
presentations of subgroups can be simply arranged with presentation templates 
(compare Explora] Klosgen, 1992). Additionally, appropriate graphical presenta- 
tions of subgroups and their deviation figures must be designed. For example, a 
simple share pattern (binary dependent variable, one population) can be graph- 
ically represented as a fourfold display including the confidence intervals of the 
share. In case of a nominal dependent variable, the set of percentages could be 
represented as a pie chart. However, already the application of pie charts to 
illustrate a single frequency distribution is heavily discussed among visualiza- 
tion experts, because of the limited capacities of humans to compare a set of 
angles. Using many pie charts to compare several subgroups and their frequency 
distributions for the values of a nominal variable is even more doubtful. 

Additional visualizations explaining mining results support the analyst in 
selecting subgroups by assessing the trade-off between generality and strength. 
Friedman and Fisher (1997) propose a trajectory visualization of subgroups in 
a two dimensional generality vs. strength space. Another example relates to 
the multicollinearity problem, a frequency distribution of the values of an input 
variable for a subgroup can help to identify the correlations between subgroups. 
Other visualizations can uncover the overlapping degree between subgroups and 
explain a suppression refinement. Friedman and Fisher (1997) also propose sensi- 
tivity plots that can be used to judge the sensitivity of the hypothesis (subgroup) 
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quality to the values of the subgroup description variables. With these plots, the 
overfitting problem is addressed. 



7 Analysis of Change and of Time Stamped Data 

Three measurements of change can be distinguished: individual, absolute, and 
relative change. When the data base includes the same objects for different time 
points (panel), individual change can be studied. Partially this can be reduced 
to the one cross section case by simply deriving variables that have the difference 
between the two time points as values. However when analyzing subgroups, a 
three dimensional cross table is studied (time x dependent variable x subgroup 
& complementary group). Thus more analysis options are available compared 
to the two dimensional cross tabulations used for the one cross section case. In 
the following we regard the case, when the cross sections are independent, i.e. 
include different objects. 

We first consider two independent cross sections, e.g. samples for two time 
points. Since the two cross sections do not include the same objects, change 
cannot be analyzed on an individual level. For panel data (same objects), more 
elaborated analysis methods can be used. Again we refer to one (or several) de- 
pendent variables, for which changes in their distribution shall be found. Two 
approaches are possible. The first one finds subgroups, for which the distribu- 
tions of the two time points are different (absolute change). A relative approach 
compares the differences of the distributions for the subgroup with a reference 
group (e.g. the whole population or the complementary subgroup). Different 
tests depend on the types of the dependent variable. 

For a binary dependent variable we want to find subgroups, for which the 
shares, i.e. Bernoulli parameter F[Y = 1), are different for the two cross sections. 
Under the null hypothesis of equal shares, the quality function based on z-scores 
(difference of shares divided by estimation of its standard error; pooled estimator 
for variance) is asymptotically A(0, 1) distributed. Subgroups are selected as 
statistically interesting for which this test statistic rejects the null hypothesis. 
With this approach, we refer to absolute change, i.e. we look for subgroups that 
have changed. Additionally, relative change can be analyzed. Then we relate the 
change of a subgroup to the change of a reference group, e.g. the complementary 
group. 

For relative change, we use as quality a z-score based test statistic that is 
AA(0, 1) distributed under the null hypothesis of equal change in the subgroup 
and its complementary subgroup. We can also regard a fixed change (measured 
in the root group) and compare the change in the subgroup with this fixed 
change. Additionally, we can rely on confidence intervals. For the subgroup and 
its parent group, resp. its complementary group, confidence intervals for change 
(difference of shares) are computed and checked for overlapping. 

Various other tests (and quality functions) are possible. For analyzing abso- 
lute change in a subgroup, we can apply a chi-square test of homogeneity for 
the cross table {Y x Time) and get a quality that is approximatively chi-square 
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distributed. A next quality is based on the odds values and the computation 
of z-scores for the log odds ratio. The odds values for the two time points are 
compared (odds-ratio). An odd value is given by the quotient of the probabil- 
ities of positive and negative values of the binary dependent variable. Finally, 
we analyse for a subgroup the relative difference of the shares for the two time 
points, e.g. (p2 ~Ti)/(l ~Ti)- Under the null hypothesis of no change, this rel- 
ative difference is asymptotically normal distributed. The z-score is calculated 
as usual based on an estimation of the standard error. 

For ordered dependent variables and analyzing absolute change in a sub- 
group, we refer to ridits to make a statement on the probability of a larger 
value of the ordered dependent variable for the second time point. Quality func- 
tions can be defined similarly as above. In case of continuous variables, methods 
based on median tests, order statistics, t-tests, and analysis of variance are used. 
For very skewed distributions, the mean-based approaches may cause problems. 
However, large samples (subgroups) allow an approximation with a normal dis- 
tribution. 

We finally consider the case of more than two independent cross sections, e.g. 
samples for several time points. Some basic questions are: Is there a variability 
in time? Is there a positive trend in time? A possible approach is, first to find 
subgroups with variability in time, and then to elaborate this pattern by sub- 
sequent more specialized analyzes for trends, etc. Variability of the shares for 
the m time points are identified with a chi-square test. Special tests are used 
to derive quality functions that measure the degree of an increase (e.g. gradient 
test), or of a positive trend. For ordered variables, ridits are analyzed with more 
complex statistical methods to deal with this case. For instance, a quality func- 
tion can be defined that measures the degree of a monotonous trend for ridits. 
More complex methods apply GLIMS for ridits. 

Another type of time dependent data is given when each object has a sep- 
arate time reference, e.g. a time stamp. In this case, the time attribute is not 
categorical with only a few values, but continuous. A typical example is a data 
base with error or action logs. For instance, Web log-files can be preprocessed 
to three relations on users, sessions, and actions. Each action has a time stamp. 
Multi-relational subgroup mining methods can be run on these data to identify 
deviating subgroups of users, sessions, or actions. 

In the context of time-stamped data, we finally shortly refer to the second 
general subgroup mining pattern directed to pairs of subgroups. The identifica- 
tion of interesting pairs of action subgroups is based on an association measure. 
See (Feldman et ah, 1997) for a comparison of several association measures. 
Whereas these association measures rely on a two dimensional cross table (num- 
ber of objects in the intersection of the two subgroups and their complementary 
parts), for time stamped data another association option is important to identify 
rules: 



if subgroup A then subgroup B (p%, within n time units) 

This rule states that p% of the actions in subgroup A are succeeded by at least 
one action in subgroup B in the same session. For instance: 87% of upload ac- 
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tions submitted in long sessions by experienced users ore succeeded by a comment 
action within an average time of 123 seconds. Several types of "successor” defi- 
nitions are possible, e.g. immediate successors with no other actions in between. 
The time information can relate to given fixed time windows, or express an 
overall measure for the actions such as the mean successor time. 

8 Spatial Mining 

Spatial Mining is necessary when each object has a space reference captured 
within a spatial attribute. Several types of spatial references can be distin- 
guished. The objects can directly represent spatial entities such as points, lines 
or areas, or they can indirectly be related to such entities such as persons living 
in areas. Specifically we deal now with the latter case where many objects are 
related to each spatial entity and the problem of how to use the spatial attribute 
for a subgroup description. For spatial attributes with a few nominal, ordered 
or equidistant values, the ^-population patterns (Section 4) can be applied to 
compare k areas. We assume now, that the spatial attribute represents many 
non-overlapping, contiguous areas defined as polygons that cover a whole region 
(e.g. a country). 

The background knowledge on polygons can be exploited to construct tax- 
onomies for the spatial attribute. An admissible value subset Vi of the spatial 
attribute can be defined by some conditions on the neighborhood structure of 
regional clusters given by a subset of neighbored values. Such a geographical 
taxonomy finder is implemented in the Data Surveyor system. 

To construct regional clusters, a bottom up, supervised and local subsearch 
for value subsets of the spatial attribute is scheduled, conditional on the out- 
come of a global spatial autocorrelation test. When the standard statistic for 
testing the target variable on independence in areas (Moran’s I) indicates a 
departure from independent observations, the clustering subsearch is run; see 
(Gebhardt, 1997) for the theoretical foundations of this solution that overcomes 
the combinatorial explosion and randomness of regional clusters. 

The separate clustering search is necessary, because the global test does not 
tell where and in which direction (positive or negative) the spatial deviations 
occur. The merge operation used in the bottom-up search relies on merging 
compact triplets of neighbored areas in each recursive search step. These compact 
triplets are derived in a preprocessing GIS operation using the polygons. The 
polygons are further used for geographic visualizations of the resulting clusters 
within a geographical map showing the spatial deviations of the target variable 
distribution. 

9 Text Mining 

Document Explorer (Feldman et ah, 1997) is a system searching for patterns in 
document collections based on subgroup mining. Patterns relate to subgroups of 
documents and pairs of subgroups (Section 2). A subgroup is described by a set 
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of terms or concepts (concept set). Each document described in the subgroup has 
to include all concepts of the concept set. These patterns provide knowledge on 
the application domain that is represented by the collection. A pattern can also 
be seen as a query or implying a query that, when addressed to the collection, 
retrieves a set of documents. Thus the data mining tools also identify interesting 
queries which can be used to browse the collection. The system searches for 
interesting concept sets and associations between concept sets, using explicit 
bias for capturing interestingness. The bias is provided by the user specifying 
syntactical, background, quality and redundancy constraints to direct the search 
in the vast implicit spaces of pattern instances which exist in the collection. The 
patterns which have been verified as interesting are structured and presented in 
a visual user interface allowing the user to operate on the results to refine and 
redirect search tasks or to access the associated documents. The system offers 
preprocessing tools to construct or refine a knowledge base of domain concepts 
and to create an internal representation of the collection which will be used 
by all subsequent mining operations. The source documents can be of text or 
multimedia type and be distributed, e.g. in Internet or Intranet. 

A knowledge base includes domain knowledge about the document area. It 
includes a concept DAG (directed acyclical graph) of the relevant concepts for 
the domain. Several categories of concepts are hierarchically arranged in this 
DAG. For the application domain of Reuters newswire collection, e.g. categories 
correspond to countries, persons, topics, etc. with subcategories like European 
Union, politicians, economic indicators. Additionally, the knowledge base con- 
tains background relations. These are binary relations between categories such 
as nationality (relation between persons and countries) or export partners (be- 
tween countries). In preprocessing, the knowledge base and a target database 
are constructed. The target database contains binary tuples. A tuple represents 
a document and the concepts being relevant for the document. All data min- 
ing operations in Document Explorer are operated on a derived trie structure, 
that is an efficient data structure to manage all aggregates existing in the target 
database. 

A concept set is simply a set of concepts. A set of concepts can be seen as an 
intermediate concept that is given by the conjunction of the concepts of the set. 
E.g. the concepts "data mining^ and "text analysis^ define a joint concept which 
can be interpreted as "data mining in text data“. Frequent concept sets are sets 
of concepts with a minimal support^ i.e. all the concepts of the set must appear 
together in at least s documents. A context is given by a concept set and is used 
as a subselection of the document collection. Then only the documents in this 
subcollection are analyzed in a search task. The system derives, for example, 
patterns “in the context of crude oil“ for the documents that contain crude oil 
as a phrase or are annotated by crude oil using text categorization algorithms. 

A binary relation between concept sets is a subset of the crossproduct of the 
set of all concept sets. An association is a binary relation given by a similarity 
function. To measure the degree of connection (similarity) between two sets of 
concepts, we usually rely on the support of the documents in the collection, that 
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include all the concepts of the two sets. If there is no document that contains 
all the concepts, then the two concept sets will have no connection (similarity 
= 0). If all the concepts of the two sets always appear together, the strongest 
connection measurable by the document collection (similarity = 1) is given. An 
association rule is a special association, defined as usual by a minimal support 
and confidence. Furthermore, the similarity of two concept sets relative to a 
category can be measured by comparing the conditional distributions of the 
concepts of the category with respect to the two concept sets. 

A keyword graph is a pair consisting of a set of nodes and a set of edges. 
Each edge connects two nodes. Quality measures are calculated for each node 
and each edge. A node corresponds to a concept set and an edge to an element of 
a binary relation. Special subsets of nodes and connections can be defined, e.g. a 
clique is a subset of nodes of a keyword graph, for which all pairs of its elements 
are connected by an edge. A path connects two nodes of a keyword graph by a 
chain of connected nodes. 

A search task is specified in Document Explorer by syntactical, background, 
quality and redundancy constraints for searching spaces of concept sets or of 
associations. The result of a search task is a group of concept sets or associa- 
tions satisfying the specified constraints. These groups of results are arranged 
in keyword graphs offering to the user interactive operations on the nodes and 
edges of the graph. 



10 Conclusion 

Subgroup mining is a pragmatic data exploration approach that can be ap- 
plied for various data analytic questions. Although subgroup mining has already 
reached a quite impressive development status, it is an evolving area for which a 
lot of problems must still be solved. These problems mainly relate to statistical 
validity of subgroup mining results, robustness of results, advanced description 
languages for time and space related variables, evaluating sets of subgroups, non- 
statistical interestingness evaluations, second order mining to combine subgroups 
found for different pattern types, and scalability to very large data bases. 
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In a first step, roughness and fuzziness fail to account for the type of grad- 
uality (vagueness) involved in the concept of a heap, as it is conceived in the 
famous Eubulides ^ paradox. One can partially bridge this gap by means of tol- 
erance rough sets. Even in this case, a non-concordance persists between the 
empirical finiteness and the theoretical infinity of a heap. Another way to ap- 
proach this problem could be via negligibility (be it cardinal, measure-theoretic 
or topological). 

The paradox of the heap of grains due to Eubulides (a pupil of Euclid) consists 
of the impossibility to answer the question: Which is the smallest number of 
grains making a heap of grains ? If such a number n exists, then n — 1 grains no 
longer form a heap, in contradiction with the empirical fact that a heap remains 
still a heap when only one grain is eliminated from it. Correspondingly, there is 
another empirical fact, according to which a non- heap cannot become a heap by 
adding to it only one grain. This is a particular way to assert that the switch 
from non-heap to heap is gradual. The concept of a heap does not have a sharp 
boundary. Eubulides’ paradox is due to the fact that the question looking for an 
answer is based on a wrong presupposition: the concept of a heap has a sharp 
boundary. 

Eormally, given a universal set f/, its subsets are of two kinds: heaps and 
non-heaps. These two classes are not defined intrinsically, but by their behavior 
in respect to some operations. As a matter of fact, it is enough to define one of 
them and the definition of the other follows immediately. 

A heap is a non-empty subset X of U such that for any x e X the set X — {x} 
is still a heap. A subset Y of 17 is said to be a non- heap if it is not a heap. 

Proposition 1. Given a non-heap Y and an element x inU — Y , the union Y 
ofY and {x} is still a non-heap. 

Proof. Accepting, by contradiction, that T is a heap, it follows that Y — {x} = 
Y is still a heap, in conflict with the hypothesis. 

Proposition 2. Any heap is an infinite subset of U . 

Proof. Let us suppose the existence of a finite heap X. There exists a natural 
number n and n elements x(l), x(2),..., x{n) in 17, such that X = {x{l)^ x(2),.... 
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x{n)}. Applying n — 1 times the definition of a heap, we infer that the sets 



X(l) = X-{a;(l)} (1) 

X(2) = X(l) - {a.(2)} (2) 

(3) 

X{n-1) = X{n-2)-{x{n-l)} (4) 

are all heaps. But we have 

X(l) = {a;(2),a;(3)...,a;(n)} (5) 

X{2) = {x(3),...,x(n)} (6) 

(7) 

X{n-1) = {x{n)} (8) 



Applying to X(n — 1) the definition of a heap, the set X{n) = X{n—1) — {x{n)} 
should be still a heap, in contradiction with the fact that X{n) is empty. 

Corollary 3. Any finite subset of U is a non-heap. 



Corollary 4 . A necessary condition for the existence of a heap is the infinity of 
U 

Proposition 5. Any infinite subset of U is a heap. 

Proof. Obvious, because any infinite set remains infinite when one element is 
eliminated from it. 

Proposition 6. A subset of U is a heap if and only if it is infinite. 

Proof. Follows from Propositions 2 and 5. 

Corollary 7. A subset of U is a non-heap if and only if it is finite. 

Proof. Follows from Proposition 6. 

Proposition 6 and Corollary 7 contradict the empirical-intuitive base of the 
concept of a heap; empirically, a heap is finite, although very large. So far, we 
have no therapy for this illness. 

Given a set Z contained in t/, let us try to approximate Z by means of 
various equivalence relations between subsets of U . Define r as a binary relation 
between subsets of (7, such that two subsets are in relation r either if they are 
both finite or if they are both infinite. Obviously, r is an equivalence relation. 

Following Pawlak [4], a subset of U is said to be totally e-unobservable, 
where e is an equivalence relation in the set P{U) of all subsets of U ^ if its 
e-lower approximation is empty, while its e-upper approximation is U. 

Proposition 8. Every proper subset of U is totally r -unobservable. 
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Proof. Excepting the trivial case when Z = no r-equivalence class is such 
that the union of its sets is contained in Z; so, the r-lower approximation of Z 
is the empty set. On the other hand, the r-upper approximation of Z is given by 
U ^ because among the infinite subsets of U which meet Z is ^7 itself. Since there 
are only two e-quivalence classes, we reach a situation of total r-unobservability. 

We will consider now the binary relation p defined as follows: two subsets A 
and B of U are in relation p if their symmetric difference is finite. 

Proposition 9. The binary relation p is an equivalence relation in P{U). 

Proof. The relation p is obviously reflexive and symmetric. Let A p B and 
B p C .If follows that A — B^ B — A, B — C ^ C — B are all finite sets. The set 
A — C is the union of a part contained in 5, i.e., in B — C ^ which is finite, and 
a part disjoint from 5, so contained in A — B^ which is again finite; it follows 
that A — C is finite. The set C — A is the union of a part contained in 5, i.e., in 
B — A^ which is finite, and a part disjoint from 5, so contained in C — B^ which 
is also finite; it follows that C — A is finite and the symmetric difference of A 
and C is finite and we have A p C ^ proving so the transitivity of p. 

Proposition 10. If A is a heap and B is p-equivalent to A, then B is still a 
heap. 

Proof. Since H is a heap, it follows from Proposition 6 that A is infinite. Since 
A — B is finite, it follows that the common part of A and B is infinite, so B is 
infinite and again, in view of Proposition 6, 5 is a heap. 

Proposition 11. There exist sometimes two heaps that are not p-equivalent. 

Proof. If U is the set of real numbers, the set of rationals and the set of irra- 
tionals are both heaps, although they are disjoint. 

Proposition 12. Any set Z strictly contained in U is totally p -unobservable. 

Proof. Since all finite subsets of U belong to the same p-equivalence class and 
U — Z is not empty, it follows that the p-lower approximation of Z is empty. 
On the other hand, the union of subsets p-equivalent to U intersects Z, because 
Z is contained in t/, so the p-upper approximation of Z is U and Z is totally 
p-unobservable. 

Another possibility is to introduce the similarity relation s; two subsets A 
and B of U are in relation s if one of the following conditions is satisfied: 

1. A = 5; 

2. there is an element n in U such that either A — {a} = B or B — {a} = A. 

The relation s is reflexive and symmetric in P{U)^ but not transitive. It is 
a tolerance relation in P{U). We associate to each subset X of U its tolerance 
class s(X), i.e., the union of all subsets Y of U such that X sY . Given a set Z 
contained in (7, let us call the 5-lower approximation of Z the union m(Z, 5 ) of all 
s{X) contained in Z; let us call the 5-upper approximation of Z the union u(Z, 5) 
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of all s{X) intersecting Z. For details of this approach, see Marcus [1], where 
it is shown how an infinite sequence of approximations of Z can be obtained. 
See also, in this respect, Nieminen [3],Polkowski et ah [5] and Pomykala [6]. In 
respect to the paradox of the heap, the relevant case for a tolerance approach is 
obtained when is a countable infinite set. The corresponding infinite sequence 
of approximations is represented by the successive finite sections. They can be 
organized as a Cech topology [1]. 

A fuzzy approach to the concept of a heap could start in the following way 
(in view of the already obtained results). We assimilate a heap with a mapping 
from P{U) into [0, 1], where for any finite subset X of U we put f{X) = 0 and 
for any infinite subset X of U we put f{X) = 1 (in view of Proposition 6). In 
this way, we get the special case of a crisp set, in conflict with the empirical 
vagueness (graduality) of a heap. The only way to avoid this failure is to follow 
an itinerary parallel to the tolerance rough set approach. For instance, let us put 
/(0) = 0, f{X) = n/(n + 1) when X is a finite subset of U of cardinal n and 
f{X) = 1 when X is an infinite subset of U. 

Another fuzzy approach could follow the way proposed by Mares [2] namely 
by means of what he calls a trapezoidal fuzzy quantity and a triangular fuzzy 
quantity; maybe in this way we could bridge the gap between the theoretical 
and the empirical aspects of our problem, by avoiding to work with infinite sets. 

The idea of a heap exploits the elementary fact that finite sets are negligible 
in respect to infinite sets, in the same sense in which zero is negligible in respect 
to the addition of real numbers. We can extend this perspective, by defining 
cardinally generalized heaps. Any set of transfinite cardinal a is a heap in respect 
to all sets of cardinal smaller than a. In a measure-theoretic perspective, sets of 
strictly positive measure are heaps in respect to sets of measure zero, while in 
a topological perspective sets of second Baire category are heaps in respect to 
sets of first Baire category. 

The baldness paradox could be investigated following similar methods, but 
the main difficulties still remain unsolved. 
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We discuss philosophical and metamat hematical origins of rough sets and their 
fundamental properties. We argue that rough sets are necessary in the light of 
the Platonian concept of ideal mathematical objects. We show how rough sets 
have been present in concept formation, diagnosis, classification and other rea- 
soning tasks. We present examples indicating that the intuitive idea of a rough 
set has been used (under various names) by physicians, engineers and philoso- 
phers as a basic tool to classify and utilize concepts in their respective domains 
of activity. We discuss the differences between rough sets and other approaches 
to incomplete and imprecise information such as fuzzy logic and logics that for- 
malize the process of “jumping to conclusions”. 

In a more formal part of out presentation, we discuss the connection of rough sets 
with three- and four- valued logics (relevance logics). We show how equivalence 
relations and related notions of rough sets generate three- valued and four- valued 
approximations of relational systems. We prove monotonicity results for such ap- 
proximations as well as preservation theorems. 

We discuss computational tasks associated with rough sets such as minimal 
discerning set of attributes selection and its weighted version. We present com- 
plexity results and some algorithms. 
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[ Abstract.] The paper contains some considerations concern- 
ing the relationship between decision rules and inference rules 
from the rough set theory perspective. It is shown that decision 
rules can be interpreted as a generalization of the modus ponens 
inference rule, however there is an essential difference between 
these two concepts. Decision rules in the rough set approach 
are used to describe dependencies in data, whereas modus ponens 
is used in general to derive conclusions from premises. 



1 Introduction 

Data analysis, recently known also as data mining, is, no doubt, a very important 
and rapidly growing area of research and applications. Historically, data mining 
methods were first based on statistics, but it is worth mentioning that their 
origin can be traced back to some ideas of Bertrand Russell and Karl Popper 
concerning reasoning about data. Recently machine intelligence and machine 
learning contributed essentially to this domain. Particularly fuzzy sets, rough 
sets, genetic algorithms, neural networks, cluster analysis and other branches 
of AI can be considered as a basic tools for knowledge discovery in databases, 
nowadays. 

Main objective of data analysis is finding hidden patterns in data. More 
specifically, data analysis is about searching for dependencies, or in other words, 
pursuing "cause-effect” relations, in data. 

From logical point of view, data analysis can be perceived as a part of induc- 
tive reasoning, and therefore it can be understood as a kind of reasoning about 
data methods, with specific inference tools. 

Reasoning methods are usually classified into three classes: deductive, induc- 
tive and common sense reasoning. 

Deductive methods are based on axioms and deduction rules, inductive rea- 
soning hinges on data and induction rules, whereas common sense reasoning 
is based on common knowledge and common sense evident inferences from the 
knowledge. 

Deductive methods are used exclusively in mathematics, inductive methods 
- in natural sciences, e.g., physics, chemistry etc., while common sense reasoning 
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is used in human sciences, e.g., politics, medicine, economy, etc. but mainly this 
method is used almost everywhere in every day life debates, discussions and 
polemics. 

This paper shows that the rough set approach to data analysis bridges some- 
how the deductive and inductive approach in reasoning about data. The rough 
set reasoning is also, to some extent, related to common sense reasoning. 

Rough set theory gave rise to extensive research in deductive logic, and var- 
ious logical systems, called rough logics, have been proposed and investigated 
(see e.g., [3, 6, 7, 10, 11, 18, 20, 23]). However, the basic idea of rough set based 
reasoning about data is rather of inductive than deductive character. Partic- 
ularly interesting in this context is the relationship between an implication in 
deductive logic and a decision rule in the rough set approach. 

In deductive logic basic rule of inference, modus ponens (MF) is based on 
implication, which can be seen as counterpart of a decision rule in decision rule 
based methods of data analysis. Although formally decision rules used in the 
rough set approach are similar to MF rule of inference, they play different role 
to that of MF inference rule in logical reasoning. Deduction rules are used to 
derive true consequences from true premises (axioms), whereas decision rules 
are description of total or partial dependencies in databases. Besides, in induc- 
tive reasoning optimization of decision rules is of essential importance, but in 
deductive logic we don’t need to care about optimization of implications used 
in reasoning. Hence, implications and decision rules, although formally similar, 
are totally different concepts and play various roles in both kinds of reasoning 
methods. Moreover decision rules can be also understood as exact or approxi- 
mate description of decisions in terms of conditions. 

It is also interesting to note a relationship between rough set based reasoning 
and common sense reasoning methods. Common sense reasoning usually starts 
from common knowledge shared by domain experts. In the rough set based rea- 
soning the common knowledge is not assumed but derived from data about the 
domain of interest. Thus the rough set approach can be also seen as a new ap- 
proach to (common) knowledge acquisition. Also the common rules of inference 
can be understood in our approach as data explanation methods. Note, that 
qualitative reasoning, part of common sense reasoning, can be also explained in 
the rough set philosophy. 

Summing up, rough set based reasoning has an overlap with deductive, in- 
ductive and common sense reasoning, however it has its own specific features 
and can be considered in its own right. 



2 Data, Information Systems and Decision Tables 

Starting point of rough set theory is a set of data (information) about some 
objects of interest. Data are usually organized in a form of a table called infor- 
mation system or information table. 
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A very simple, fictitious example of an information table is shown in Table 
1. The table describes six cars in terms of their (attributes) features such as fuel 
consumption perceived quality (Q), selling price [P) and marketability {M). 



Table 1. An example of information system 



Car 


F 


Q 


P 


M 


1 


high 


fair 


med. 


poor 


2 


V. high 


good 


med. 


poor 


3 


high 


good 


low 


poor 


4 


med. 


fair 


med. 


good 


5 


V. high 


fair 


low 


poor 


6 


high 


good 


low 


good 



Our main problem can be characterized as determining the nature of the 
relationship between selected features of the cars and their marketability. In 
particular, we would like to identify the main factors affecting the market ac- 
ceptance of the cars. 

Information systems with distinguished decision and condition attributes are 
called decision tables. 

Each row of a decision table determines a decision rule^ which specifies deci- 
sions [actions) that should be taken when conditions pointed out by condition 
attributes are satisfied. For example in Table 1 the condition [F^high)^ [QJair)^ 
[P^med) determines uniquely the decision [M^poor). Decision rules 3) and 6) in 
Table 1 have the same conditions but different decisions. Such rules are called 
inconsistent [nondeterministic^ conflicting^ possible); otherwise the rules are re- 
ferred to as consistent [certain^ deterministic^ nonconflicting ^ sure). Decision 
tables containing inconsistent decision rules are called inconsistent [nondeter- 
rninistic^ etc); otherwise the table is consistent [deterministic^ etc). 

The number of consistent rules to all rules in a decision table can be used as 
consistency factor of the decision table, and will be denoted by 7 (C, D), where C 
and D are condition and decision attributes respectively. Thus if y(C, D) < 1 the 
decision table is consistent and if 7 (C, D) 1 the decision table is inconsistent. 
For example for Table 1 7 (C, D) = 4/6. 

In what follows information systems will be denoted hjS= ([/, A), where 
U - IS universe^ A is a set of attributes^ such that for every x ^ U and a G A, 
a[x) G V/, and V/ is the domain (set of values of a) of a. 

3 Decision Rules and Certainty Factor 

Decision rules are often presented as implications and are called ’A/... then..J^ 
rules. For example. Table 1 determines the following set of implications: 
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1) if high) and (QJair) and {F^med) then (M^poor)^ 

2) if [F^v.high) and [Q^gooct) and [P^mect) then [M^poor)^ 

3) if [F^high) and [Q^good) and [P^low) then [M^poor)^ 

4) if [F^med,) and [QJair) and [P^med,) then [M^good)^ 

5) if {F/oMigh) and {QJair) and [P^low,) then {M/poor), 

6) if [F^high) and [Q^good) and [P^low) then [M^gooct)^ 

In general decision rules are implications built up from elementary formulas 
(attribute name, attribute value) and combined together by means of proposi- 
tional connectives "and”, "or" and "implication" in a usual way. 

Let F and F be logical formulas representing conditions and decisions, re- 
spectively and let # ^ be a decision rule, where Fs denote the meaning of F 
in the system S', i.e., the set of all objects satisfying F in S, defined in a usual 
way. 

With every decision rule F ^ F we associate a number, called a certainty 
factor of the rule, and defined as 

where |^| denotes the cardinality of F. Of course 0 < fis{F^F) < 1; if the rule 
F ^F consistent then jas{F^F) = 1, and for inconsistent rules jas{F^F) < 1. 
For example, the certainty factor for decision rule 2) is 1, and for decision rule 
3) is 0.5. 

The certainty factor can be interpreted as a conditional probability of a 
decision F given the probability of the condition F, 

It is worth mentioning that association of conditional probability with impli- 
cation first was proposed by J. Lukasiewicz in the context of multivalued logic 
and probabilistic logic [4]. This idea has been pursued by other logicians years 
after [1]. In the rule based knowledge systems many authors also proposed us- 
ing conditional probability to characterize certainty of the decision rule [2] . In 
particular in the rough set approach association of condition probabilities with 
decision rules have been pursued e.g., in [21, 24, 27]. 

Now the difference between use of implications in classical logic and in data 
analysis can be clearly seen, particularly in the rough set framework. Implication 
in deductive logic is used to draw conclusions from premises, by means of modus 
ponens rule of inference. In reasoning about data implications are decision rules 
used to describe patterns in data. Hence, the role of implications in both cases is 
completely different. Besides, modus ponens is an universal rule of inference valid 
in any logical system, but decision rules are strictly associated with a specific 
data and are not valid universally. 
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However in the rough set approach decision rules can be also exploited in a 
similar way as modus ponens in logic. Let us consider the following formula: 

7Ts(!/^) = V(7Ts(^>) • jj.si'P, W)) = Unsi'P A !/') (*) 

where S is taken over all conditions # associated with the decision corresponding 
to and 7Ts(#) = 

7T5'(#) is a probability that the condition # is satisfied in S, Thus formula (*) 
shows the relationship between the probability of conditions, certainty factor of 
a decision rule and the probability of decisions. 

Hence the formula (*) allows to compute probability that the decision W is 
satisfied in A, in terms of the probability of condition <P and conditional proba- 
bility of the decision rule <P 

This is a kind of analogous structure to modus ponens inference rule and 
can be treated as its generalization, called rough modus ponens [RMP) [15]. The 
certainty factor of a decision rule can be seen as generalization of the rough 
membership function. It can be also understood as a rough inclusion factor in 
rough mereology [16, 17] or as a degree of truth of the implication associated 
with the inclusion. 

4 Approximations of Sets 

The main problem discussed in the previous section can be also formulated as 
follows: can we uniquely describe well (poorly) selling cars in terms of their 
features. Of course, as before, this question cannot be answered uniquely, since 
cars 3 and 6 have the same features but car 3 sells poorly whereas car 6 sells 
well, hence we are unable to give unique description of cars selling well or poorly. 

But one can observe that in view of the available information we can state 
that cars 1, 2 and 5 surely belong to the set of cars which are selling poorly, 
whereas cars 1, 2, 3, 5 and 6 possibly belong, to the set of cars selling poorly, 
i.e. cannot be excluded as cars selling poorly. Similarly car 4 surely belongs to 
well selling cars, whereas cars 3, 4 and 6 possible belong to well selling cars. 
Hence, because we are unable to give an unique characteristic of cars selling well 
(poorly), instead we propose to use of two sets, called the lower and the upper 
approximation of the set of well (poorly) selling cars. 

Now, let us formulate the problem more precisely. 

Any subset B of A determines a binary relation Ib on [/, which will be 
called an indi seer nihility relation^ and is defined as follows: xIbV if and only if 
a[x) = a[y) for every a E where a[x) denotes the value of attribute a for 
element x. Obviously is an equivalence relation. The family of all equivalence 
classes of i.e., the partition determined by H, will be denoted by U/Ib^ or 
simply U/B] an equivalence class of i.e., the block of the partition U/B^ 
containing x will be denoted by B[x). 

If {x^y) belongs to we will say that x and y are B- indiscernible. Equiva- 
lence classes of the relation 1b (or blocks of the partition U / B) are referred to 
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as B-elementary concepts or B- granules. As mentioned previously in the rough 
set approach the elementary concepts are the basic building blocks (concepts) 
of our knowledge about reality. 

The indiscernibility relation will be used next to define basic concepts of 
rough set theory. Let us define now the following two operations on sets 

B^{X) = {x eU : B{a:) <Z X}, 

B%X) ={xeU : B{x) n A 7 ^ 0}, 

assigning to every subset X of the universe U two sets B^^{X) and B*{X) called 
the lower and the B-upper approximation of A, respectively. The set 

BNb{X) = B%X)-B,{X) 

will be referred to as the B-houndary region of A. If the boundary region of A 
is the empty set, i.e., BNb{X) = 0, then the set A is crisp (exact) with respect 
to in the opposite case, i.e., if BNb{X) 7 ^ 0, the set A is referred to as rough 
[inexact) with respect to B. 

For example, the lower approximation of the set {1,2, 3, 5}, of poorly selling 
cars, is the set {1,2,5}, whereas the upper approximation of poorly selling cars is 
the set {1,2, 3, 5, 6 }. The boundary region is the set {3,6}. That means that cars 
1, 2 and 5 can be surely classified, in terms of their features, as poorly selling 
cars, while cars 3 and 6 cannot be characterized, by means of available data, as 
selling poorly or not. Rough sets can be also defined using a rough membership 
function^ defined as 

[ 0 ^ !]• 

Value of the membership function jax{x) is kind of conditional probability, and 
can be interpreted as a degree of certainty to which x belongs to A (or 1 —iax{X)^ 
as a degree of uncertainty) . 

For example, car 1 belongs to the set {1,2, 3, 5} of cars selling poorly with 
the conditional probability 1, whereas car 3 belongs to the set with conditional 
probability 0.5. 



5 Dependency of Attributes 

Our main problem can be rephrased as whether there is a functional dependency 
between the attribute M and attributes F, Q and P. In other words we are asking 
whether the value of the decision attribute is determined uniquely by the values 
of the condition attributes. It is easily seen that this is not the case for the 
example since cars 3 and 6 have the same values of condition attributes but 
different value of decision attribute. The consistency factor y(C, F) can be also 
interpreted as a degree of dependency between C and D. We will say that D 
depends on (7 in a degree (0 < < 1), denoted C F, if = 7 (C, D). 
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If /c = 1 we say that D depends totally on C, and if k < 1, we say that D 
depends partially (in a degree k) on C. 

For example, for dependency {F, P, Q} => {M} we get = 4/6 = 2/3. 
Dependency of attributes can be also defined using approximations as shown 
below. 

We will say that D depends on C in a degree k{0 < k < 1), denoted C F, 
if 

k = 7 (C, D) = where POSc{D) = IJ C,{X), 

' ' xeuiD 

called a positive region of the partition U / D with respect to (7, is the set of all 
elements of U that can be uniquely classified to blocks of the partition U /D, by 
means of C. Obviously 



T(C.D)= E 

xeuiD ’ ’ 

If = 1 we say that D depends totally on C, and ifk < 1, we say that D depends 
partially (in a degree k) on C. 

The coefficient k expresses the ratio of all elements of the universe, which 
can be properly classified to block of the partition P/D, employing attributes 
C. 

6 Reduction of Attributes 

We often face a question whether we can remove some data from a data table pre- 
serving its basic properties, that is - whether a table contains some superfluous 
data. This can be formulated as follows. 

Let C, D C A, be sets of condition and decision attributes, respectively. We 
will say that C' C C is a D-reduct (reduct with respect to D) of C, if C' is a 
minimal subset of C such that 

j{C,D) = rC',D). 

Thus reduct enables us to make decisions employing minimal number of condi- 
tions. 

For example, for Table 1 we have two reducts F, Q and h\ F. It means that 
instead of Table 1 we can use either Table 2 or Table 3, shown below. 

These simplifications yield to the following sets of decision rules. For Table 
2 we get 

1) if high) and [QJair) then [M^poor)^ 

2) if high) and [Q^gooct) then [M^poor)^ 
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Table 2. Reduced information system Table 3. Another reduced information 
system 



Car 


F 


Q 


M 


1 


high 


fair 


poor 


2 


V. high 


good 


poor 


3 


high 


good 


poor 


4 


med. 


fair 


good 


5 


V. high 


fair 


poor 


6 


high 


good 


good 



Car 


F 


P 


M 


1 


high 


med. 


poor 


2 


V. high med. 


poor 


3 


high 


low 


poor 


4 


med. 


med. 


good 


5 


V. high 


low 


poor 


6 


high 


low 


good 



3) if high) and [Q^gooct) then [M^poor)^ 

4) if [F^med,) and [QJair) then [M^gooct)^ 

5) if [F/v, high) and [QJair) then [M^poor)^ 

6) if [F^high) and [Q^good) then [M^good)^ 
and for Table 3 we have 

7) if [F^high) and [P^med) then [M^poor)^ 

8) if [F^v.high) and [P^med) then [M^poor)^ 

9 ) if [F^hi gh) and [P^ low) then ( poor ) , 

10) if [F/rned) and [P/rned) then [M^good)^ 

11) if [F/vPigh) and [P^low) then [M^poor)^ 

12) if (F^high) and 
[P^low) then [M^good), 

Hence, employing the notion of the reduct we can simplify the set of decision 
rules. 

7 Conclusions 

Using rough sets to reason about data hinges on three basic concepts of rough set 
theory: approximations, decision rules and dependencies. All these three notions 
are strictly connected and are used to express our imprecise knowledge about 
reality, represented by data obtained from measurements, observations or from 
knowledgeable expert. 




Reasoning about Data - A Rough Set Perspective 



33 



The rough set approach to reasoning about data bridges to some extent 
the deductive and inductive way of reasoning. Decision rules in this approach 
can be understood as implications, whose degree of truth is expressed by the 
certainty factor. Consequently, this leads to generalization of the modus ponens 
inference rule, which in the rough set framework has a probabilistic flavor. It is 
interesting that the certainty factor of a decision rule is closely related to the 
rough membership function and to rough inclusion of sets, basic concept of rough 
mereology. 

8 Acknowledgments 

Thanks are due to Prof. Andrzej Skowron and Dr. Marzena Kryszkiewicz for 
critical remarks. 



References 

1. Adams, E. W.: The logic of Conditionals, An Application of Probability to De- 
ductive Logic. D. Reidel Publishing Company, Dordrecht, Boston (1975) 

2. Bandler, W. Kohout, L.: Fuzzy power sets and fuzzy implication operators. Fuzzy 
Sets and Systems 4 (1980) 183-190 

3. Banerjee, M., Chakraborty, M.K.: Rough logics: A survay with further direc- 
tions. In: E. Orlowska (ed.): Incomplete information: Rough set analysis. Physica— 
Verlag, Heidelberg (1997) 579-600 

4. Borkowski, L. (ed.): Jan Lukasiewicz - Selected Works. North Holland Publishing 
Company, Amsterdam, London, Polish Scientific Publishers, Warszawa (1970) 

5. Dempster, A. R: Upper and lower probabilities induced by induced by the mul- 
tiplevalued mapping. Ann. Math. Statistics 38 (1967) 325-339 

6. Demri, S., Orlowska, E.: Logical analysis of indiscernibility. Institute of Computer 
Science, Warsaw University of Technology, ICS Research Report 11/96 (1996); 
see also: E. Orlowska (ed.): Incomplete information: Rough set analysis. Physica— 
Verlag, Heidelberg (1997) 347-380 

7. Gabbay, D., Guenthner, F.: Handbook of Philosophical Logic Vol.l, Elements of 
Glassical Logic, Kluwer Academic Publishers, Dordrecht, Boston, London (1983) 

8. Lukasiewicz, J.: Die logischen Grundlagen der Wahrscheinlichkeitsrechnung. 
Krakow (1913) 

9. Magrez, R, Smets, R: Fuzzy modus ponens: A new model suitable for applications 
in knowledge-based systems. Information Journal of Intelligent Systems 4 (1975) 
181-200 

10. Orlowska, E.: Modal logics in the theory of information systems. Zeitschrift fur 
Mathematische Logik und Grundlagen der Mathematik 30 (1984) 213-222 

11. Pagliani, R: Rough set theory and logic-algebraic structures. In: E. Orlowska 
(ed.): Incomplete information: Rough set analysis. Physica- Verlag, Heidelberg 
(1997) 109-190 

12. Pawlak, Z.: Rough probability. Bull. Polish Acad., Sci. Tech. 33 ( 9 - 10 ) (1985) 
499-504 

13. Pawlak, Z.: Rough set theory and its application to data analysis. Systems and 
Gybernetics (to appear) 




34 



Z. Pawlak 



14. Pawlak, Z.: Granularity of knowledge, indiscernibility and rough sets. IEEE Con- 
ference on Evolutionary Computation (1998) 100-103 

15. Pawlak, Z.: Rough Modus Ponens. IPMIP98 Conference, Paris (1998) 

16. Polkowski, L., Skowron, A.: Rough Mereology. Proc. of the Symphosium on 
Methodologies for Intelligent Systems 869 (1994) 85-94, Charlotte, N.C., Lec- 
ture Notes in Artificial Intelligence, Springer Verlag 

17. Polkowski, L., Skowron, A.: Rough Mereology: A New Paradigm for Approximate 
Reasoning. Journ. of Approximate Reasoning 15 ( 4 ) (1996) 333-365 

18. Rasiowa, H., Marek, W.: Approximating sets with equivalence relations. Theoret. 
Comput. Sci. 48 (1986) 145-152 

19. Rasiowa, H., Skowron, A.: Rough concepts logic. In: A. Skowron (ed.). Compu- 
tation Theory, Lecture Notes in Computer Science 208 (1985) 288-297 

20. Rauszer, C.: A logic for indiscernibility relations. In: Proceedings of the Confer- 
ence on Information Sciences and Systems, Princeton University (1986) 834-837 

21. Skowron, A.: Management of uncertainty in AI: A rough set approach. In: V. 
Alagar, S. Belgrer and F.Q. Dong (eds.) Proc. SOFTEKS Workshop on Incom- 
pleteness and Uncertainty in Information Systems, Springer Verlag and British 
Computer Society (1994) 69-86 

22. Trillas, E., Valverde, L.: On implication and indistinguishability in the setting of 
fuzzy logic. Management Decision Support Systems Using Fuzzy Sets and Possi- 
bility Theory, Verlag TU (1985) 198-212 

23. Vakarelov, D.: A modal logic for similarity relations in Pawlak knowledge repre- 
sentation systems. Fundamental Informaticae 15 (1991) 61-79 

24. Wong, S.K.M., Ziarko, W.: On learning and evaluation of decision rules in the con- 
text of rough sets. Proceedings of the International Symposium on Methodologies 
for Intelligent Systems (1986) 308-224 

25. Zadeh, L.: Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems 
1 (1977) 3-28 

26. Zadeh, L.: The role of fuzzy logic in in the management of uncertainty in expert 
systems. Fuzzy Sets and Systems 11 (1983) 199-277 

27. Ziarko, W., Shan, N.: KDD-R: A comprehensive system for knowledge discovery 
using rough sets. Proceedings of the International Workshop on Rough Sets and 
Soft Computing (RSSCT4) 164-173, San Jose (1994); see also: T. Y. Lin and A. 
M. Wildberger (eds.). Soft Computing, Simulation Councils, Inc. (1995) 298-301 




Information Granulation and its Centrality in 
Human and Machine Intelligence 



Lotfi A. Zadeh 

Professor in the Graduate School and Director 
Berkeley Initiative in Soft Computing (BISC). 

Computer Science Division and the Electronics Research Laboratory 
Department of EECS, University of California, Berkeley, CA 94720-1776 
Telephone: 510-642-4959; Fax: 510-642-1712 
E-mail: zadeh@cs.berkeley.edu 



Abstract 

In our quest for machines which are capable of performing non-trivial human 
tasks, we are developing a better understanding of the centrality of information 
granulation in human cognition, human reasoning and human decision-making. 
In many contexts, information granulation is a reflection of the finiteness of 
human ability to resolve detail and store information. In many other contexts, 
granulation is employed to solve a complex problem by partitioning it into sim- 
pler subproblems. This is the essence of the strategy of divide and conquer. 
What is remarkable is that humans are capable of performing a wide variety of 
tasks without any measurements and any computations. A familiar example is 
the task of parking a car. For a human it is an easy task so long as the final 
position of the car is not specified precisely. In performing this and similar tasks, 
humans employ their ability to exploit the tolerance for imprecision to achieve 
tractability, robustness and low solution cost. What is important to recognize is 
that this essential ability is closely linked to the modality of granulation and, 
more particularly, to information granulation. 

In a very broad sense, granulation involves partitioning of whole into parts. 
In more specific terms, granulation involves partitioning a physical or mental 
object into a collection of granules, with a granule being a clump of objects 
(points) drawn together by indistinguishability, similarity, proximity or func- 
tionality. Granulation may be physical or mental; dense or sparse; and crisp or 
fuzzy, depending on whether the boundaries of granules are or are not sharply 
defined. 

Modes of information granulation (IG) in which granules are crisp play im- 
portant roles in a wide variety of methods, approaches and techniques. Among 
them are: interval analysis, quantization, chunking, rough set theory, diakoptics, 
divide and conquer, Dempster- Shafer theory, machine learning from examples, 
qualitative process theory, decision trees, semantic networks, analog-to-digital 
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conversion, constraint programming, image segmentation, cluster analysis and 
many others. 

Important though it is, crisp IG has a major blind spot. More specifically, 
it fails to reflect the fact that in much - perhaps most - of human reasoning 
and concept formation the granules are fuzzy rather than crisp. For example, 
the fuzzy granules of a human head are the nose, ears, forehead, hair, cheeks, 
etc. Each of the fuzzy granules is associated with a set of fuzzy attributes, e.g., 
in the case of hair, the fuzzy attributes are color, length, texture, etc. In turn, 
each of the fuzzy attributes is associated with a set of fuzzy values. For example, 
in the case of the fuzzy attribute Length(hair), the fuzzy values are long, short, 
not very long, etc. The fuzziness of granules, their attributes and their values is 
characteristic of the ways in which human concepts are formed, organized and 
manipulated. In effect, fuzzy information granulation (fuzzy IG) may be viewed 
as a human way of employing data compression for reasoning and, more partic- 
ularly, making rational decisions in an environment of imprecision, uncertainty 
and partial truth. 

In fuzzy logic, the machinery of fuzzy information granulation - based on 
the concepts of a linguistic variable, fuzzy if-then rule and fuzzy graph has long 
played a key role in most of its applications. However, what is emerging now is 
a much more general theory of information granulation which goes considerably 
beyond its place in fuzzy logic. This more general theory leads to two linked 
methodologies - granular computing (GrG) and computing with words (GW). 

In GW, words play the role of labels of granules and the initial and ter- 
minal datasets are assumed to consist of propositions expressed in a natural 
language. The input interface serves to translate from a natural language (NL) 
to a generalized constraint language (GGL), while the output interface serves 
to re-translate from GLG to NL. Internally, granular computing is employed to 
propagate constraints from premises to conclusions. 

The importance of the methodologies of granular computing and comput- 
ing with words derives from the fact that they make it possible to conceive 
and design systems which achieve high MIQ (Machine Intelligence Quotient) by 
mimicking the remarkable human ability to perform complex tasks without any 
measurements and any computations. 

Although GrG and GW are intended to deal with imprecision, uncertainty 
and partial truth, both are well-defined theories built on a mathematical foun- 
dation. In coming years, they are likely to play an increasingly important role 
in the conception, design, construction and utilization of information/intelligent 
systems. 
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Abstract. A typical real-life data set is affected by inconsistencies — 
cases characterized by the same attribute values are classified as mem- 
bers of different concepts. The most apparent methodology to handle 
inconsistencies is offered by rough set theory. For every concept two sets 
are computed: the lower approximation and the upper approximation. 
Erom these two sets a rule induction system induces two rule sets: certain 
and possible. 

The problem is how to use these two sets in the process of classification of 
new, unseen cases. Eor example, should we use only certain rules (or only 
possible rules) for classification? Should certain rules be used first and, 
when a case does not match any certain rule, should possible rules be 
used later? How to combine certain and possible rules with complete and 
partial matching of rules by a case? This paper presents experiments that 
were done to answer these questions. Different strategies were compared 
by classifying ten real-life data sets, using the error rate as a criterion of 
quality. 



1 Introduction 

The main idea of knowledge discovery is to look for regularities in the raw data 
describing some real-life phenomena. Such regularities are often presented in the 
form of if-then rules [8]. For example, a data set describes patients diagnosed by 
a clinician on the basis of many attributes (tests done in a laboratory, questions 
asked by a physician, etc.). The resulting rule set is put into an expert system 
and is used for diagnosis of new patients. 

fn experiments presented in this paper a rule induction system called LERS 
(Learning from Examples based on Rough Sets) was used. LERS was studied 
in many papers, see, e.g., [3]. Other rule induction systems, based on rough set 
theory, were described in [11], [12], [14], and [15]. Usually, training data sets, 
i.e., data sets used for rule induction, are inconsistent. Eor example, two pa- 
tients described by the same values of all attributes are diagnosed as members 
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of two different concepts (e.g., one may be sick and the other may be healthy). 
LERS uses an approach to inconsistent data sets based on rough sets [9] and 
[10]. First, LERS checks training data for consistency. If data are inconsistent, 
for every concept two sets are computed: lower approximation and upper ap- 
proximation [2] and [3]. Then rule sets are induced separately from both sets. In 
our experiments, option LEM2 (Learning from Examples, version 2) was chosen 
to induce rule sets. Rules induced from lower approximations are called certain^ 
while rules induced from upper approximations are called possible. The termi- 
nology, introduced in [2], is based on the following observation: if a case is a 
member of the lower approximation of the concept, it is certainly a member of 
the concept. Similarly, if a case is a member of the upper approximation, it is 
only possibly a member of the concept. 

The question is how to use both rule sets for classification of new cases, 
members of a testing data set. For example, how to use these two rule sets for 
classification of new patients, which rule induction system was not aware of. 
Besides, a classification system of LERS has four parameters that may be set 
up by the user. Thus the user may select many strategies for classification. Our 
objective was to determine the ranking of these strategies. 

The standard process of LERS classification system may use four parameters: 
strength, specificity, matching factor and support. In our experiments, classifi- 
cation was performed using four options for choosing rule sets, two options for 
using specificity, and two options for using matching. Based on these different 
combinations, sixteen classification strategies were developed. A set of experi- 
ments was conducted: the sixteen strategies were tested on ten real-world data 
sets. The performance of the sixteen strategies were measured by the error rates 
for each strategy and each data set. 

2 LERS classification scheme 

The process of classification used in LERS has four factors: Strength, Specificity, 
Matching .factor, and Support. The original approach was introduced under the 
name of bucket brigade algorithm, see [I] and [7]. In this approach, the classi- 
fication of a case is based on three factors: strength, specificity, and support. 
The additional factor, used for partial matching, was added to LERS [4]. In the 
bucket brigade algorithm partial matching is not used at all. These four factors 
are defined as following: 

Strength is a measure of how well the rule performed during training. It is 
the number of cases correctly classified by the rule in training data. The bigger 
strength is, the better. Strength was used in all sixteen strategies, following a 
recommendation from [5]. 

Specificity is a measure of completeness of a rule. It is the number of con- 
ditions (attribute- value pairs) of a rule. It means a rule with a bigger number 
of attribute-value pairs is more specific. Specificity may or may not be used to 
classify cases. In our experiments, both options were used: using specificity and 
not using specificity. 
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For a specific case, if complete matching, where all attribute- value pairs of 
at least one rule match all attribute- value pairs of a case is impossible, LERS 
tries partial matching. During partial matching all rules with at least one match 
between the attribute-value pairs of a rule and the attribute-value pairs of a 
case are identified. Matching -factor is a measure of matching of a case and a 
rule. Matching_factor is defined as the ratio of the number of matched attribute- 
value pairs of a rule with a case to the total number of attribute-value pairs of 
the rule. In our experiments, two options of matching were used: using complete 
matching first, then using partial matching if necessary and using both complete 
matching and partial matching at the same time. Thus, in partial matching, 
MatchingTactor was always used. 

Support is related to a concept C. It is the sum of scores of all matching rules 
from C. Support is defined as follows: 



E 



Strength{R) * Specif i city {R) >i< Matching-factor{R) 



partially matching 
rules R describing C 



The concept with the largest score wins the contest and the case is classified 
as belonging to this concept. If there is a tie among concepts, the strongest rule 
determines a concept. Support was used for classifying in all sixteen strategies, 
again, following an advice from [5]. In the above formula any factor may be equal 
to one, for example we may set specificity as equal to one. We say then that the 
corresponding classification strategy does not use specificity. During complete 
matching the value of Matching .factor is always equal to one. Obviously, during 
complete matching, in the above formula, partially matching rules R describing 
C should be interpreted as completely matching rules R describing C. 

3 Sixteen classification strategies 

In our experiments, we used four combinations for using rule sets: using only 
certain rules, using only possible rules, using certain rules first, then possible 
rules if necessary, and using both certain and possible rules. Since we used four 
options for choosing rule sets, two options for specificity, and two options for 
matching, sixteen different strategies were tested in our experiments: 

1. Using only certain rules, specificity, complete matching, then partial match- 
ing if necessary, 

2. Using only certain rules, specificity and both complete matching and partial 
matching, 

3. Using only certain rules and complete matching, then partial matching if 
necessary, not using specificity, 

4. Using only certain rules and both complete matching and partial matching, 
not using specificity, 

5. Using only possible rules, specificity and complete matching, then partial 
matching if necessary. 
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6. Using only possible rules, specificity, both complete matching and partial 
matching, 

7. Using only possible rules, complete matching, then partial matching if nec- 
essary, not using specificity; 

8. Using only possible rules and both complete matching and partial matching, 
not using specificity, 

9. Using certain rules first, then using possible rules if necessary, specificity, 
and complete matching, then partial matching if necessary, 

10. Using certain rules first, then using possible rules if necessary, specificity and 
both complete matching and partial matching, 

11. Using certain rules first, then using possible rules if necessary and complete 
matching, then partial matching if necessary, not using specificity, 

12. Using certain rules first, then using possible rules if necessary and both 
complete matching and partial matching, not using specificity, 

13. Using both certain rules and possible rules, specificity, complete matching, 
then partial matching if necessary, 

14. Using both certain rules and possible rules, specificity, and both complete 
matching and partial matching, 

15. Using both certain rules and possible rules, complete matching, then partial 
matching if necessary, not using specificity, 

16. Using both certain rules and possible rules and both complete matching and 
partial matching, not using specificity. 

4 Experiments 

An overview of the ten real-life data sets is presented in Table 1. For each data 
set 16 experiments were performed, using all 16 different classification strategies. 
For each data set the performance of each classification strategy was measured in 
terms of the error rate. The error rate was estimated using n- fold- cross-validation 
[13]. The training data and testing data were generated from an available data 
set. During the process, first each data set was re-shuffled (the order of cases was 
randomly changed), then was divided into n subsets with approximately equal 
size. Each such data subset was used for testing exactly once. Each time, the 
remaining n — 1 subsets were used as training data (to induce rules by LERS). 
The average error rate for all n iterations was defined as the final error rate for 
the data set and the classification strategy. For moderate and large sets (more 
than or equal to 100 cases), n is chosen to be equal to 10; for small sets (less 
than 100 cases), n was the total number of cases. The later method is called 
a leaving- one- out approach. The average error rates for the ten data sets are 
presented in Tables 2 and 3. 

The purpose of our experiments was to compare the strategies and find the 
best strategy. In order to compare the overall performance of all strategies, the 
Wilcoxon Signed Ranks Test [6], a nonparametric test for significant differences 
between paired observations, was used. All sixteen strategies were compared and 
ordered, from the best to the worst, see Figure 1. Note that some strategies are 
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Data set 


Number 
of cases 


Number 
of attributes 


Number 
of concepts 


Thesaurus 


129,797 


3 


5 


Wisconsin 


625 


9 


9 


Breast cancer 


286 


9 


2 


Nursing 


90 


8 


3 


HSV 


122 


12 


4 


Primary tumor 


339 


17 


21 


Mammography 


1284 


12 


2 


Luktrain 


1654 


13 


2 


Bupa 


345 


6 


2 


Iris 


150 


4 


3 




Table 1. 


Data sets 







Thesaurus 


Wisconsin 


Breast 


Nursing 


HSV 


1 


44.19 


21.76 


32.52 


36.67 


50.00 


2 


44.69 


17.92 


29.72 


28.89 


34.43 


3 


44.01 


21.76 


33.22 


35.56 


50.00 


4 


44.51 


17.92 


29.72 


28.89 


33.61 


5 


32.34 


22.08 


31.47 


38.89 


45.90 


6 


44.29 


17.92 


29.72 


28.89 


33.61 


7 


36.33 


21.76 


31.12 


38.89 


45.90 


8 


44.53 


17.92 


29.72 


28.89 


33.61 


9 


32.34 


22.56 


31.12 


36.67 


50.00 


10 


44.23 


17.92 


29.72 


28.89 


34.43 


11 


36.33 


22.56 


31.82 


35.56 


50.00 


12 


44.51 


17.92 


29.72 


28.89 


33.61 


13 


32.34 


21.76 


30.07 


37.78 


47.54 


14 


44.25 


17.92 


29.72 


28.89 


33.61 


15 


36.33 


21.60 


31.12 


35.56 


48.36 


16 


44.52 


17.92 


29.72 


28.89 


33.61 






Table 2. Error rate 
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Tumor 


Mammography 


Luktrain 


Bupa 


Iris 


1 


61.95 


31.54 


8.04 


48.70 


26.00 


2 


74.93 


38.55 


10.40 


47.54 


30.00 


3 


61.65 


32.40 


7.92 


52.75 


26.00 


4 


75.22 


39.95 


10.40 


51.59 


28.67 


5 


59.88 


23.13 


7.62 


38.26 


11.33 


6 


73.16 


23.60 


10.40 


42.03 


30.67 


7 


59.00 


22.43 


7.38 


41.45 


26.67 


8 


74.93 


25.00 


10.40 


41.16 


29.33 


9 


57.52 


23.52 


8.22 


38.55 


11.33 


10 


74.93 


38.55 


10.40 


47.54 


30.00 


11 


57.52 


22.74 


8.10 


41.74 


26.67 


12 


75.22 


39.95 


10.40 


51.59 


28.67 


13 


56.34 


23.05 


8.16 


38.55 


11.33 


14 


75.22 


24.07 


10.40 


41.74 


34.00 


15 


55.46 


22.12 


7.74 


41.45 


26.67 


16 


75.22 


27.80 


10.40 


40.87 


29.33 



Table 3. Error rate 



presented by the same node of the diagram in Figure 1, e.g., strategies 5 and 9. 
This means that the corresponding strategies are not significantly different. On 
the other hand, some strategies, e.g., 2 and 3, are not connected by any arc in 
Figure 1. This means that these strategies are statistically incomparable. 

5 Conclusions 

Our research focused on the classification process using rule sets induced by 
LERS. The following conclusions may be drawn from our experiments. 

Choosing different matching options has no significant impact on the clas- 
sification results. It means that there is no significant difference between using 
complete matching first, then partial matching if necessary and using both, com- 
plete and partial matching. As follows from the diagram in Figure 1, the even 
numbered strategies have about the same performance as the odd numbered 
strategies do. Thus, different options of matching were not important to the 
classification process. 

When using the same rule set option with matching option: using complete 
matching first, then partial matching if necessary (strategies 1, 3, 5, 7, 9, 11, 13 
and 15) there is no significant difference between using specificity and not using 
specificity. 

The following matching option: using complete matching first, then partial 
matching if necessary, using both certain and possible rules yields the best results 
(strategies 13 and 15), using only certain rules yields the worst results (strategies 
1 and 3). 

Using the same rule set option with matching option: using both complete 
and partial matching, there is no significant difference between using specificity 
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and not using specificity. The corresponding strategies are: 2 vs. 4, 6 vs. 8, 10 
vs. 12, and 14 vs. 16. 



Smaller error rate 



Bigger error rate 




Choosing between the certain rules and possible rules is an important factor 
for classification. The option with only certain rules (strategies 1, 2, and 3) 
results in a bigger error rate, and the option with using both certain rules and 
possible rules (strategies 13, 14, 15, and 16) yields a smaller error rate. 

Summarizing, there is no one single best strategy among 16 strategies used 
for classification. Thus, a choice of successful strategy depends on kind of input 
data as well. 
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[ Abstract.] The prime ingredients of the operations of the 
human cognitive mind are descriptions. Descriptions may be 
approximate in the sense that imprecision may not allow the 
construction of a set from a word description. In a previous pa- 
per this type of imprecision was introduced with the name of 
approximate sets. The operations on approximate sets defined 
there were as precise as possible, but they had some difficul- 
ties, one of them being that the union operation defined was 
not associative. It was foreseen that precision in the operations 
might be traded for operational convenience. In this paper such 
a possibility is investigated, less precise operations are offered, 
and their convenience is studied. 



1 Introduction 

The prime ingredients of the operations of the human cognitive mind are descrip- 
tions. The objects about whose descriptions the human mind deal with may be 
considered from many points of view, and consequently may admit different clas- 
sifications. At this moment we wish to bear in mind a particular classification 
according to which an object may be 

considered an individual object ov o, set object. We admit that an individual 
object is described by the set of its features. We pay attention now to the set 
objects. One possible way of describing a set object is by means of the list of all 
of its components. Another way is by means of sentences relating the features of 
the components without referring to any individual object in particular. 

This has been recognized for a long time, and used to establish two kinds of 
definitions for sets: (1) the definition by enumeration of its components, and (2) 
the definition of a set by a property. 

Descriptions may be approximate in the sense that imprecision does not 
allow the construction of a set from a word description. An example will help 
to clarify this discussion. Suppose a soccer team A, A has 25 registered players 
Pi,P 2 : • • • ^ has to play an official tournament game, say in Warsaw, and a 

friendly invited game in Cracow. The Coach has decided that players p\ to piQ 
will be available to play in Warsaw and players ps to p 2 S will travel to Cracow. 
It is known that a soccer team consists of 11 players and that in a particular 
official game only three players may be substituted. This means that at least 11 
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and at most 14 of players pi to piQ will play in Warsaw. It has been a greed 
that in Cracow four players may be substituted, thus in Cracow from 11 to 15 of 
players ps to p2S will play. This type of imprecise knowledge may be described 
by a formal statement. 

In a previous paper [2] this type of imprecision was introduced with the 
name of approximate sets^ and the way of handling these descriptions to compute 
some possible inferences was studied. Operations of union and intersection were 
defined. In the above example the union wo uld determine the approximate set 
describing the players that would play in either Warsaw or Cracow or both, while 
the intersection would establish those who would play in both cities. 

The operations on approximate sets defined in our previous paper we re 
as precise as possible, but they had some difficulties, to be discussed later in 
more detail, one of them being that the union as defined was not associative. It 
was foreseen that precision in the operations might be traded for operational con- 
veni ence. In this paper such a possibility is investigated, less precise operations 
are offered, and their convenience is studied. 

2 Approximate sets 

In this section we present succintly the definition of 

approximate sets and the main results that were presented in our previous 
paper [2]. 

Let X — {a, 6, . . .,n} be a finite, extensionally defined, set of cardinal |A|, 
and let Xi, X 2 be two natural numbers. We define approximate set X as a three- 
tuple (xi,X2)X, where 0 < xi < X 2 < \X\. The meaning of X = (xi,X2)X is 
that X contains between xi and X 2 elements of X. We call xi the lower bound 
of the approximate set, and X2, the upper bound; X is called the base set An 
approximate set describes approximately a subset of A. It is not known which 
elements constitute it, not even its cardinality. If we apply this definition to 
the example presented above, the players playing the game in Warsaw could 
be defined by the approximate set (11, 14){pi, . . . ,pi6}, while the players in the 
Cracow game could be described by (11, 15){p8, • • • 7T25}' 

The usual mathematical quantifiers V and 3 define approximate sets. For 
instance, given a universe X Vx defines the approximate set (|A|, |A|)A, and 
3x defines (1, |A|)A. Quantifiers used in more sofist icate logic models, like the 
counting quantifiers [1] also define approximate sets. The counting quantifier 
3-^ with the interpretation There are at least l\ defines the approximate set 
{l,\X\)X. 

Two approximate sets X = (xi,X2)A, Y = (yi, y 2 )Y are said to be equal iff 
= ?/i7 ^2 = 2/2, A = y. It must be noted that even though X and Y are both 
subsets of the same set, 

X = Y does not imply we are referring to the same subset. Not only that, it 
is entirely possible that the two subsets be disjoint. 

Approximate sets show a partial order. We shall say that given two approxi- 
mate sets X = (^i,^ 2)A, Y = (yi,^2)T, X is included in Y and represent it as 
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X □ Y iff Xi < ^1, X2 < 2/2, X C 4^. As before, the two subsets defin ed by X 
and Y are not necesarily included one in the other. 

There are two operations defined over approximate sets. Given the sets 
X,T, X U T, X n T, where U, fl represent the union of sets and intersection 
of s ets respectively, we define the union^ X U Y, and intersection^ X H Y of 
approximate sets X = (xi,X2)X, Y = (2/i,2/2)T as 

XUY= (a,6 )(XuT) 
xn Y= (c,d)(XnT) 

where 

a = max(xi, 2/i) b = min((x2 + 2/2), |X U Y|) 

c = max(0, {xi + 2/1 ~ |X U T|)) d = min(x2,2/2, |X n T|) 

These operations are commutative, neither is idempotent, the intersection is 
associative, and the union X U Y U Z is not associative. 

These operations were defined so as to have the best computational behavior ^ 
that is, to lose the least precision possible when operating with approximate 
sets. Unfortunately, as the last paragraph shows, the operational behavior^ the 
algebraic characteristics of the operations, is very poor. In the next section we 
present new operations that, although may decrease the precision while comput- 
ing, do have much better algebraic characteristics. 

3 New definition of operations 

In this section we will define the new union and intersection operations on ap- 
proximate sets. From now on the operations defined in Section 2 will be referred 
to as square-union and square-intersection, and the new operations to be defined 
below will be referred to simply as union and intersection. 

3.1 Union 

The union operation over two approximate sets X = {xi^X2)X and Y = (2/1 , 2/2) T 
is defined as 

(X) 

X VY = |J(XUY)(”) 

n=l 

that is, first a square-union is done between X and 

Y, and the result is operated with itself through square-unions infinite times. 
Once X U Y is performed, on each operation of this square-union with itself, the 
lower bound remains unchanged while the upper bound, if it is not already 
|X U T|, will strictly increase until reaching this value. Therefore, it is easy to 
show that the final result of this operation is 

XVY = (max(xi,2/i),|XuT|)(XuT) (1) 

Comparing this operation to the square-union, we see that there might be a 
loss in precision, as the upper bound grows goes from min((x2 + 2/2), |X U T|) to 
|XUT|. That is, XUY EX VY. 
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3.2 Intersection 

The intersection operation i s defined analogously to the union as 

oo 

X AY = ri(xn Y)(”) 

After the first square- intersect ion is performed, on each operation of the result 
with itself, the upper bound will remain unchanged, while the lower bound will 
strictly decrease, unless it is zero. Therefore the final result is 

X A Y = (0, min(x 2 , ?/ 2 , | A n T |))(A n T) (2) 

Again there might be a loss of precision respect to the square-intersection 
as the lower bound diminishes from max(0, {xi yi — |A U T|)) to 0. That is, 

Xn Y EX AY. 

As we shall see in Section 4, this loss of precision is compensated by a better 
behavior of these operations. 

3.3 Complement of an approximate set 

In our previous paper in which we introduced approximate sets there was no 
definition of the complement of an approximate set. We proceed to introduce 
the definition here. 

An approximate set is an imprecise description of a concept, so the comple- 
ment of an approximate set should be an imprecise description of the complement 
of the concept. If given a set S of 30 students, we know that between 2 and 12 
are bright, we can express this concept by B = (2, 12)5'. Suppose we want to 
know how many students are not bright. It is easy to see that this concept is 
described as NB = (18, 28)5. From here we establish a definition of complement 
of an approximate set. 

[ Definition l.]Given an approximate set X = {xi^X 2 )X^ we define the com- 
plement of this set^ X as 



X= {\X\-X 2 ,\X\-xi)X 

It must be stressed that X and X are both subsets of the same base set X, 

4 Operation behavior 

[ Theorem l.]The union and intersection operations are commutative. 

This property follows from the definition of the operations as shown in (1) and 

( 2 ). 



[ Theorem 2.]The intersection operation is associative. 
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This result follows directly from the associativity of the square-intersection. □ 
[ Theorem 3.]The union operation is associative, 

[ Proof,] 

(X V Y) V Z = (max(xi ,yi),\XVJ Y\){X VJ Y) y Z 

= (max(max(xi, ^i), zi), |(X U T) U Z\)[[X U T) U Z) 

= (max(xi ^yiZ\)^\X yjY U Z\)[X U T U Z) 

= (max(xi , max(^i , zi)), |X U (T Z)\)[X U (T U Z)) 

= X V (max( 2 /i, zi), \Y U Z\){Y U Z) 

= XV(Y VZ) 

□ 

[ Theorem 4.]The intersect ion is distributive over the union 
[ Proof,] 

X A (Y V Z) = X A (max( 2 /i, zi), |T U Z\){Y U Z) 

= (0, min(x2, |T U Z\,\Xn{YU Z)\){X n (T U Z)) 

= (0, min(x 2 , |(x n T) u (X n z)\){{x nY)u{xn z)) 



Given that the upper bound of a union does not depend on the upper bounds of 
its operands, we can now divide this into the union of two approximate sets with 
the upper bounds set to min(x 2 ,?/ 2 , |Xny|) and min(x 2 , Z 2 , |XnZ|) respectively, 
and the proof is almost complete 

= (0, min(x2,?/2, |X fl y|))(X fl T) V (0, min(x2, Z2, |X fl Z\)){X fl Z) 

= (X AY) V(X AZ) 



□ 

In likewise fashion, it is easily seen that the union is not distributive over the 
intersection. 

[ Theorem 5.]There is no zero element for the union operation 

[ Proof,]Let X = {xi^X 2 }X be an approximate set, and N = (ni,n 2 )X be the 
candidate to zero element, that is, X V N = X. From the definition of the union 
operation, it must be that X U N = X, so N C X, but then the upper bound 
of the union will be |X|, and this is in general different from X 2 , so the result of 
X V N is in general different from X. □ 

[ Theorem 6.]There is no unit element for the intersection operation 
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[ Proo/] From the definition of the intersection operation, the lower bound must 
be 0, and that is different from xi, so there is no unit element. □ 

With the theorems proven above we establish that the set of all approximate 
sets of a given universe with the union and intersection have the structure of 
a semi-ring without a unit element. This is clearly an improvement over the 
algebraic properties of the square-union and square-intersection, that presented 
no definite structure. 

Following we present some results showing the characteristics with respect 
to sets of the new operations. 

[ Theorem 7.]X AYCXCXVY 

[ Proof, ]We apply the definition of the operations and notice that 



0 < < max(xi,^i) 

min(x 2 ,?/ 2 , \X n Y|) < ^2 < |XU Y| 

{Xr\Y) CX C (XUY) 

and by the definition of inclusion of approximate sets given in Section 2, the 
result is proven. □ 

[ Theorem 8.]Neither operation is idernpotent 

[ Prop/.] Applying the definition of the operations, same as above, we obtain 

0 < = max(xi,xi) 

min(x2, ^2, |AnA|) = X2 < |AUA| 

(A n A) = A = (A U A). 

Given the inequalities that remain, the proof is complete. □ 

[ Theorem ^ Although not idernpotent ^ we could say that they are quasi- idernpotent^ 
because (XvX)vX = (XvX) and (XaX)aX = (XaX). hegintheorem X = X 

[ Proof,] As usual, X = {xi^X 2 )X, 

X = (|A|-x2,|A|-xi)A = (|A| - (|A| - xi), I A| - (|A| - X2))A = (xi, X2)A 

□ 

When trying to establish the operational behavior of the complement we 
quickly run into a problem due to the lack of a common framework: the com- 
plement of {xi^X 2 )X belongs to A, the complement of (?/i,?/ 2 )F belongs to F, 
and therefore there is no way in general to relate one to the other. In set theory 
this is solved through the Universe, that is, all sets are supposed to be subsets 
of a common frame. If we establish a universe P, we can prove the following 
theorem. 
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[ Theorem 10.] Within a common framework U both de Morgan^ s Laws hold for 
approximate sets with the V, A operations and the complement. 

[ Proof.]For t he first de Morgan Law. Let X = {xi^X2)U and Y = {yi^y2)U^ 
On the one hand 

X VY = (max(xi,2/i),|t/|)t/ 

= (0, \u\ - max{xi,yi))U 

And on the other hand 

X A Y = {\U\ - X2, \u\ - xi)U A {\U\ - V2, \u\ - yi)U 
= ( 0 , \U\ - max(xi,yi))[/ 

For the second de Morgan’s Law we proceed in the same fashion. 

XaY = (0, min(x2,y2, |e|)(7 

= (|( 7 | - min(x 2 ,y 2 ), VA)U 

And 

XVY = {\U\-X2,\U\-xi)U y {\U\-y2,\U\-yi)U 
= (max(|( 7 | - X2, \U\- ^2), \U\)U 
= {\U\ - mm{x 2 ,y 2 ), \U\)U 

□ 



5 Conclusion 

We have presented new union and intersection operations on approximate sets. 
These operations present better algebraic behavior than the original ones, al- 
though some loss of precision might occur. We have also introduced a comple- 
ment operation and proven that under certain circumstances de Morgan’s Laws 
hold for approximate sets. 

This work is just a first approach in trying to establish well-behaviored op- 
erations on approximate sets. Further research on exactly what loss of precision 
must be incurred to obtain a good algebraic behavior is warranted. 
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Abstract. We discuss an uncertainty representation based on rough 
membership functions in inconsistent decision tables. We propose a rea- 
soning model for dealing with objects having probability distributions 
instead of concrete values on attributes. We prove discernibility charac- 
teristics for minimal boolean implicants in inconsistent decision tables 
with indeterministically defined objects. 



1 Introduction 

Theory of rough sets ([4]) gave the origin to many methods of reasoning about 
data. One of them is how to search for optimal sets of features for classification 
of new cases. By adopting the principles of boolean reasoning, a lot of algorith- 
mic approaches to the above task have been developed (see e.g. [3] for further 
references), with the notion of consistent decision table as the starting point. 
However, since inconsistency has been introduced by handling lower and upper 
set approximations ([4]), rough sets based research has been providing more and 
more strategies of dealing with inconsistency (see e.g. [9], [10]). 

We begin from recalling the rough set approach to data analysis (see e.g. [2], 
[7]). Then we discuss uncertainty representation in inconsistent decision tables 
with respect to rough membership functions ([5]), which have a strong support 
in probability theory and statistics (see [6] for further references). Most of the 
definitions and results presented in Sections 2 and 3 can be found in rough 
set literature (see e.g. [5], [8]), assumed that attributes remain discrete, with 
values uniquely determined over considered objects. In Section 4 we let objects 
be indeterministically defined, with probability distributions instead of concrete 
values on attributes. Decision tables begin to be a basis for reasoning about new 
cases remaining indeterministic as well. Such generalization requires verification 
of fundamental notions and results. Here we focus especially on the problem of 
finding minimal frequential reduct in a decision table with uncertain objects, 
providing discernibility characteristics analogous to that for classical model. 

This work was supported by KBN Research Grant No. 8T11C01011 and ESPRIT 
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2 Rough sets based approach to data 

While reasoning about a domain specified by our needs, we are usually forced to 
base just on information gathered by the analysis of some sample of objects. The 
main paradigm of rough sets theory ([4]) states that such a universe of known 
objects, stored within an information system, is assumed to be the only source 
of knowledge able to be used for classifications of cases outside the sample. 

An information system is a tuple A = ([/, A), where each attribute a G A, 
corresponding to some feature which may be important with respect to object 
classification, is identified with function a : U ^ Va^ from the universe U of 
objects, onto the set \4 of all possible values on a. 

While reasoning about new objects outside U we refer to equivalence classes 
of indiscernibility relation, defined, for arbitrary subset C A, as 

IND{B) = {{uuU2)eUxU : /n/^ K) = /n/^ M} (1) 

where information function /n/^, such that Infs (u) = (aq (u ) , .., (^))? 

is consistent with fixed linear ordering A =< ai, .., acard(A) >• One can see that 
there is a one-to-one correspondence between equivalence classes of IN D [B) and 
elements from Vb U XaeB^a ~ the set of all vector values occurring on B in A, 
i.e. such that wb ^ Vb iff there is at least one u E U satisfying InfB (u) = wb- 

In applications, reasoning is usually stated as a classification problem, con- 
cerning distinguished decision attribute to predict under given conditions. By a 
decision table we understand a triple A = ([/, A,d), where d ^ A corresponds 
to partition of U onto pairwise disjoint decision classes denoted by d~^ 

The very initial model of decision table is the following: 

Definition 1. Decision table A = ([/, A, d) is called consistent iff for each u E U 
its indiscernibility class \u]j^ = {N G U : Inf a {n') = Inf a (w)} is included in 
one of decision classes d~^ E Vd^ 

In case of such consistency we classify new cases by analogy with those from 
the universe, i.e., given some new ^ U indiscernible from u G t/, we predict 
that new has decision value d(u), since this is the only supported choice. If 
Inf A {n) = wa and d (u) = Vd^ then we can say that our table generates decision 
rule A = Wa => d = Vd^ stating that if any object is equal to wa on A, then 
it is going to have decision value Vd- We can also introduce boolean implication 
A => d which holds iff each wa € Va implies some Vd G Vd in the above sense. 
One can easily see that decision table A = ([/, A, d) is consistent iff boolean 
implication A => d is satisfied. In other words, the decision table is consistent iff 
decision d can be completely determined from the conditions. 

Definition 2. Given consistent decision table A = ([/, A,d)^ subset B C A is 
called a decision implicant iff it satisfies boolean implication B => d. If B is 
minimal in the sense of inclusion^ i.e. such that there is no its proper subset 
C C B holding implication C ^ d^ then we call B a decision reduct for A. 
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The above definition is connected with the fundamental question whether 
we do need all conditions to determine the decision. Let us give an exemplar 
argumentation for searching for decision implicants with possibly small number 
of attributes. We say that decision table A is applicable to an object new ^ U 
under B C A iS new fits some indiscernibility class of IN D {B). Since for any 
€ XaeA^a we have implication wa € W € W (where w^ G Vb 

is the projection of wa onto S), we know that applicability to new objects is 
potentially more probable under smaller B C unless it does not preserve 
precision of decision classification. 

Proposition 1. Given consistent decision table A = ([/, subset B C A is 

a decision implicant iff it discerns objects belonging to different decision classes. 

As a corollary, we obtain that searching for minimal decision reducts is com- 
plex enough to force us to base it on some heuristics. One can prove that the 
problem of finding decision implicant with minimal number of attributes in a 
consistent decision table is NP-hard. On the other hand, Proposition 1 enables 
to implement some random heuristics efficiently enough in order to obtain ap- 
proximately optimal solutions in a relatively short time (see e.g. [3]). In fact, 
this advantage suggested us to verify the power of discernibility characteristics 
with respect to more general rough set models. 

3 Prequential reasoning 

In classical model of consistent decision table A = ([/, A, d), where each indis- 
cernibility class of IND (A) is contained in one of the decision classes, preserving 
an information about decision is equivalent with its complete determination. In 
case of any inconsistencies, however, we must settle the way of representing inde- 
terministic knowledge. For any fixed C A, let us introduce rough membership 
function (ad/B x Vb ^ [0, 1], defined by the following formula 

= ( 2 ) 

where \wb\ is the number of objects with vector value wb on S, and \wBAd\ 
- the number of objects which additionally have the decision value equal to 

Vd. We propose, given linear ordering (where |d| denotes the 

number of values possible for d), to consider rough membership distribution 
f^d/B • ^ ^\d\-i defined by 



IJ'd/B{wB)= (A/wb) , ...,/id/B (3) 

where denotes (|d|-l)-dimensional simplex. Rough membership distribu- 

tions correspond to frequencies of putting indiscernibility classes of 1ND[B) 
into particular decision classes. They are widely studied in rough sets as well 
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as in statistics (see e.g. [5], [6]). We have to remember that is just one of 

the examples of a decision function which, for arbitrary B C specifies con- 

ditional information about the decision attribute. On the other hand, however, 
it expresses the whole knowledge about dependencies of the decision on condi- 
tions, unless some additional information outside the decision table is provided 
(compare with [10]). 

Given the above representation, one begins to handle uncertain decision rules 
of the form B = wb ^ d = Vd with the probability (J^d/B for uny B C 

wb Thus, the notion of decision reduct is reformulated as follows. 

Definition 3. Given decision table A = ([/, A, subset B C A is called a 
jd-irnplicant iff for each wa € Va 

Mb (®T ) = MA {wa) (4) 

If B is minimal in sense of inclusion^ satisfying (f) for all wa € W; u;e call it 
a fa-decision reduct. 

For consistent decision tables, the above is equivalent to Definition 2. Indeed, 
in such a case, distribution fJ^d/A (^a)^ for each particular wa ^ W, corresponds 

to a unique vertex of A\d\-i and thus it must be also the case for pd/B 

if (4) satisfied. Characteristics provided by Proposition 2 for consistent tables 
can be generalized as well. 

Proposition 2. Given decision table A = ([/, A, subset B C A is a p- 
im.plicamt iff it is am im.plicamt for a consistent decision table A^ = ([/, A, 
or^ equivalently iff implication B => pm/ A holds; This means that each two 
G Ua; such that pd/A (^^/^a) Td/A (^^/^a) least one Vd G Vdy 

must be discerned by B, 

Proof We have to prove that equivalence pd/B = Pd/A^ understood in terms of 
(4), holds iff B => pd/A’ (=^) From left to right, by converse, let us assume 
that there are g Fa, such that pd/A hd/A (^^/^a) f^^ 

some Vd ^ Vd: with the same projection onto S, denoted by wb^ Then 
condition (4) must fail for at least one of i = 1,2, because if not, then 
we would have pd/B {vd/vJB) = Pd/A (^^/^a) f^^ both i = 1,2 and as a 
result - equality pd/A (^^/^a) = Td/A (^^/^a) contradictive to that above. 
(<^) Now, let us assume that B => pd/A is satisfied. Let us consider arbitrary 

wb ^ Vb and subset Va (vjb) = |^A € Fa ^ To finish the proof 

it is enough to show that if for arbitrary Vd G Vd values pd/A (^^/^a) are 
the same for all wa € Fa (^b), then they are equal to pd/B Let 

us enumerate members of Va (vjb) from w^ to w^ for appropriate n. Since 
Pd/A (^^/^a) i® constant, we have that for each i = 1, ..., n there is such real 
number ki that = ki • and \w\\ = ki • I'Wa]^ where ki = 1, 
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We can rewrite 

\^b\ Y^wa€Va{wb) \'»’a\ Ei=i,..,„(fci • l^il) \w\\ 

( 5 ) 

So {vd/wB) is equal to y-d/A (^d/wi) and thus to jJ^d/A (^d/w^) foi" any 
other i = 2, ..., n. 

The above result can be interpreted in both pessimistic and optimistic ways. 
The first of them is due to the corollary that for any inconsistent decision table 
the problem of finding minimal //-implicant is NP-hard.The optimistic interpre- 
tation is that we could not expect lower time complexity, since the above problem 
had already turned out to be NP-hard for the special case of consistent decision 
tables. We should rather be happy that complexity does not increase; indeed, we 
managed to reformulate the search problem for inconsistent tables to consistent 
ones. The cost of such reformulation, for each particular A, is negligible with 
respect to further computations. Moreover, space complexity remains the same 
and, finally, we can apply algorithmic tools based on discernibility for A^. 



4 A generalization of the frequential model 

So far, we assumed that the source of our knowledge consists of objects which 
had deterministically defined attribute values. In many applications, however, we 
cannot observe concrete values of attributes from Au{d}. Instead, for each u E 

we have a probability distribution, denoted here by Ua = (^a) ? • • • ? ^ j ^ • 

From now on we will write A = (t/^,A,d) while talking about decision tables 
with objects remaining uncertain in the above sense. 

One must realize that once we begin to deal with uncertain objects inside the 
universe, we cannot expect certain values from new objects occurring at the input 
of reasoning process. Thus each new takes the form of vector of neway a G A, 
of simplex elements, where new a {va) is the chance that it has value Va on a. To 
obtain probability that given new has decision value Vd ^ Fd, by referring it to 
with respect to C A, we first compute distribution concerning its possible 
vector values on and then combine it with estimated conditional probabilities 
of Vd under particular elements of XaeB^a^ Starting from distribution over vector 
values, let us put 

newB {wb) = P newa (6) 

cl^B 

as the expected chance that new has wb on B. By new"^ C Xa^B^a we denote 
the subset of all vector values such that newB {'ivb) > 0- Further, in order to 
extract conditional information contained in A = (t/^. A, d), let us introduce an 
expected membership function : Vd x ^b 1] defined by 

XB,d XB,Vd) 

Mb (w_b) 



Xd/B Xd/'U’B) 



(7) 
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where 

I^B,d{'^B,Vd) = R Ua (8) 

ueu^ aeBU{d} 

and C XaeB^a is set of all vector values wb such that (de- 

fined analogously to (8)) is greater than zero. Formula (7) generalizes rough 
membership function, since it is equivalent to (2) in case of deterministically 
defined attribute values over the universe. Obviously, in case of both (6) and 
(7), one could wonder whether we should be allowed to use multiplications, as 
if assuming that attributes from B correspond to pairwise independent random 
variables. However, on the other hand, this is the only way to combine proba- 
bilistic knowledge within a product space, unless some additional information 
concerning dependencies among attributes is given. We would like to use the 
following estimator 

Md/B (wd/new) = newB {wb) (9) 

wb^V* 

corresponding to probability of Vd for new under B. It is analogous to the formula 
for the total probability, where, under fixed wb^ i'^d/'^B) is the chance that 
d = Vd and newB corresponds to the chance that new has particular wb 

on B. Let us generalize the notion of applicability introduced in Section 2 as 
follows. 

Definition 4. We say that decision table A = (t/^,A,d) is applicable to new 
object new ^ with respect to B C A ijf there is inclusion new"^ C UJ. 

From now on we would like, for fixed S C A, to reason only about new 
objects satisfying the above inclusion. Obviously, one could argue that such a 
restriction is too strong, since it would be better to take decision under at least 
a part of information, corresponding to subdomain of possible vector values 
from newB H UJ, than to do nothing. However, we must remember that such 
an approach may lead to wrong predictions - here we are trying to deal with 
probabilities as concrete real numbers and thus, for any new, the loss of any 
information may result with completely irrelevant conclusions. 

Consideration of only these new objects which satisfy conditions of Definition 
4 for given B C A has also one additional advantage. Let us note that to obtain 
well defined distribution estimator /ad/B (new), defined in a standard way, we 
have to check whether hd/B (n^/new) = 1 and normalize formula (9) if 

it is not the case. It turns out that such equality is satisfied iff new^ C UJ. This 
fact shows that indeed only in case of objects to which A is applicable there is 
no loss of information along the reasoning procedure, i.e. the whole knowledge 
about a new object can be used for its probabilistic classification. 

Definition 5. Given decision table A = (t/^,A,d)^ subset B C A is called an 
expected, p-iraplicant ijf for each new ^ such that new\ C n?e have 
equality 



fJ-d/B {new) = iJ-d/A {new) 



( 10 ) 
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If B is minimal in the sense of inclusion^ satisfying (10)^ we call it an expected 
jx- decision reduct. 

Given the generalization of the notion of applicability onto decision tables 
with uncertain objects, we can repeat the argument for searching for minimal im- 
plicants from Section 2. Inclusion new\ C in the above definition means that 
we can be initially interested only in new cases for which given decision table is 
applicable with respect to the whole set of conditional attributes. For fixed new, 
it is easy to prove that this inclusion implies new^ C FJ. It lets us claim that, 
just like before, handling smaller subset of conditions gives a chance of applying 
gathered information to potentially larger number of new objects. Let us con- 
clude with the following discernibility characteristics, generalizing Proposition 2 
from Section 3. 

Proposition 3. Given decision table A = (t/^,A, subset B C A is an ex- 
pected jx-irnplicant iff it satisfies boolean implication B => he. discerns all 

pairs G Vf with different uncertain membership distributions. 

Proof (=>) Just like before, let us start with proving that condition connected 
with (10) implies B => converse, let us assume that there are 

w\,w\ e V^, such that A/a ^ A/aA/'^a) foi" € 

with the same projection onto B^ denoted by wb G Then, for at 
least one i = 1,2 we must have [t^^b 7^ hd/A 

mains to take as a counterexample object new indicating w^, i.e. such that 
new A (^a) = which both equalities jXd^B 

and satisfied. (<^) To opposite direction, 

let us just note that computations analogous to (5) result with the fact that 
B => implies equality {'^b) = Ad/ A (^a), for any G FJ and 

^A ^ ^A’ belonging to Vf {wb) = |^a ^ ^a * ^A^ ^ Further, let us 

consider arbitrary new such that new^ C Vf. For any B C A and G 
we can write 

hd/A {vd/new) = E E newA{wA)Ad/Aad/wA) ( 11 ) 

WbEV* wa^VX(.'^b) 

However, we already know that if => then for any fixed G F^ 

values f^d/'^A) are equal to {'i^d/'^s) for all wa G Vf (w^). Thus 
we obtain 

Md/A {vd/new) = E E new A {wa) I Ad/B Xd/wB) (12) 

WbEV^ \WA^VXiwB) / 

Now it is enough to realize that inclusion new\ C Vf implies that we can 
replace '^WAev*(wB)'^^'^^ newB {wb): which finally leads to the 

equality 114 / a {vd/new) = 11 . 4 / b {v4/new). 
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5 Conclusions 

We presented a rough set approach to searching for minimal sets of features 
classifying objects in terms of rough membership information. We also proposed 
a reasoning model for dealing with objects indeterministically defined on at- 
tributes. Due to discernibility characteristics for minimal frequential implicant 
search problem, attributes preserving frequential decision information can be 
found using algorithms implemented for consistent decision tables, where com- 
plexity remains similar. Still, requirements corresponding to such preservation 
are too rigorous in practice and thus we may need some approximations (compare 
with [10]). Another direction for further research is to reason about indetermin- 
istic cases under "almost” applicable information, by introducing a kind of rough 
inclusion ([1]) to definition of applicability. 
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1 Introduction 

As pointed out by Greco, Matarazzo and Slowinski [1] the original rough set ap- 
proach does not consider criteria^ i.e. attributes with ordered domains. However, 
in many real problems the ordering properties of the considered attributes may 
play an important role. E.g. in a bankruptcy evaluation problem, if firm A has 
a low value of the debt ratio (Total debt/Total assets) and firm B has a large 
value of the same ratio, within the original rough set approach the two firms are 
just discernible, but no preference is established between them two with respect 
to the attribute “debt ratio” . Instead, from a decisional point of view, it would 
be better to consider firm A as preferred to firm B, and not simply “discernible” , 
with respect to the attribute in question. 

Motivated by the previous considerations, Greco, Matarazzo and Slowinski [2] 
proposed a new rough set approach to take into account the ordering properties 
of criteria. Similarly to the original rough set analysis, the proposed approach is 
based on approximations of a partition of objects in some pre-defined categories. 
However, differently from the original approach, the categories are ordered from 
the best to the worst and the approximations are built using dominance relations, 
being specific order binary relations, instead of indiscernibility relations, being 
equivalence relation. The considered dominance relations are built on the basis 
of the information supplied by condition attributes which are all criteria. In this 
paper we generalize this approach considering a set of condition attributes which 
are not all criteria. 

The paper is organized in the following way. In the second section, the main 
concepts of the rough approximation based on criteria and attributes are intro- 
duced. In section 3 we apply the proposed approach to a didactic example to 
compare the results with the original rough set approach. Final section groups 
conclusions. 

2 Multicriteria and multiattribute rough approximation 

As usual, by an information table we understand the 4-tuple S = {U^Q^V^ f)^ 
where U is a finite set of objects, Q is a finite set of attributes^ V = UsqI", 
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and Vq is a domain of the attribute g, and f : U x Q ^ V is a, total function 
such that f{x^q) G Vq for every q ^ Q, x e U, called an information function 
(cf. Pawlak [4]). 

Moreover, an information table can be seen as decision table assuming the 
set of attributes Q = C [J D and C H D = 0, where set C contains so called 
condition attributes^ and decision attributes. 

In general, the notion of attribute differs from that of criterion, because the 
domain (scale) of a criterion has to be ordered according to a decreasing or 
increasing preference, while the domain of the attribute does not have to be 
ordered. We will use the notion of criterion only when the preferential ordering 
of the attribute domain is important in a given context. Formally, for each q £ C 
which is a criterion there exists an outranking relation (Roy [6]) Sq on U such 
that xSqH means “x is at least as good as y with respect to attribute g”. We 
suppose that Sq is a total preorder, i.e. a strongly complete and transitive binary 
relation on U. Instead, for each attribute q ^ C which is not a criterion, there 
exists an indiscernibility relation Iq on U which, as usual in rough sets theory, 
is an equivalence binary relation, i.e. reflexive, symmetric and transitive. We 
denote by the subset of attributes being criteria in C and by the subset 
of attributes which are not criteria, such that U = C and H = 0. 
Moreover, for each P C C we denote by the set of criteria contained in (7, i.e. 
P> = P n (7^, and by P^ the set of attributes which are not criteria contained 
in (7, i.e. P= =PnC=. 

Let Pp be a reflexive and transitive binary relation on P, i.e. Pp is a a partial 
preorder on P, defined on the basis of the information given by the attributes in 
pec. More precisely, for each P C (7 we can define Pp as follows: Vx, y ^ U, 
xRpy if xSqy for each q G P^ (i.e. x outranks y with respect to all the criteria 
in P) and xlqy for each q G P^ (i.e. x is indiscernible with y with respect to all 
the attributes which are not criteria in P). If P C (7> (i.e. if all the attributes 
in P are criteria) and xRpy^ then x outranks y with respect to each q e P and 
therefore we can say that x dominates y with respect to P. Let us observe that in 
general Vx, y ^ U and VP C (7, xRpy if and only if x dominates y with respect 
to P^ and X is indiscernible with y with respect to P^. 

Furthermore let Cl = {Cltp G T}, T = be a set of classes of 

P, such that each x G P belongs to one and only one Clt G CL We suppose 
that Vr, 5 G T, such that r > s, the elements of CP are preferred (strictly or 
weakly (Roy [6])) to the elements of Cp. More formally, if 5 is a comprehensive 
outranking relation on P, i.e. if Vx, y G U xSy means “x is at least as good as 

we suppose 



[x G CP,^ G CP,r > 5] ^ [xSy and not ySx]. 

In simple words the classes Cl represent a comprehensive evaluation of the ob- 
jects in P: the worst objects are in Cp, the best objects are in Cln, the other 
objects belong to the remaining classes CP, according to an evaluation improv- 
ing with the index r G T. E.g. considering a credit evaluation problem we can 
have T = {1, 2, 3}, Cl = {Cp, Cp, CI3} and Cp represents the class of the “un- 
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acceptable” firms, CI2 represents the class of “uncertain” firms, CI3 represents 
the class of “acceptable” firms. 

Starting from the classes in Cl, we can define the following sets: 





Cl*- = U C1„ 






s>t 






= U Cl,. 






s<t 




Let us remark that Clp 


= Cl| = U, Cl^ = Cln and Clf = Cli. 


Furthermore 


Vt = 2, • • • , n we have: 


cilLi = U- 


(1) 


and 


Q 

II 

1 

Q 

f\A 


(2) 


For each P C C, let be 







Rp{x) = {y eU : yRpx}, 

Rp{x) = {y &u ■. xRpy). 

Let us observe that, given x & U, Rp{x) represents the set of all the objects 
y G U which dominates x with respect to (i.e. the criteria of P) and are 
indiscernible with x with respect to P^ (i.e. the attributes of P). Analogously 
Rp{x) represents the set of all the objects y G U which are dominated by x with 
respect to P^ and are indiscernible with x with respect to P^. 

We say that, with respect to P C C and t e x e U belongs to Cly without 
any ambiguity if x e Clf and y G Cly for all the objects y ^ U dominating x 
with respect to P^ and indiscernible with x with respect to P^. 

Formally, remembering the refiexivity of Pp, we can say that x belongs 
to Cly without any ambiguity if Rp{x) C Cly. Furthermore we say that, with 
respect to P C C and t ^ y ^ U could belong to Cly if there exists at least one 
object X G Cly such that y dominates x with respect to P^ and y is indiscernible 
with X with respect to P^, i.e. y G Kp{x). Our definitions of lower and upper 
approximation are based on the previous ideas. Thus, with respect to P C C, 
the set of all the objects belonging to Cly without any ambiguity constitutes the 
lower approximation of Cly, while the set of all the objects which could belong 
to Cly constitutes the upper approximation of Cly. 

Formally, Vt G T and VP C C we define the lower approximation of Cly 
with respect to P, denoted by PCly, and the upper approximation of Cly with 
respect to P, denoted by PCly, as: 

PC\} = {x &U ■. R%{x) ^ Clj-}, 

PC4^= y R+p{x). 

xec\} 
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We say that, with respect to P C C and t e x ^ U belongs to Cly without 
any ambiguity if x G Cly and y G Cly for all the objects y ^ U dominated by x 
with respect to P^ and indiscernible with x with respect to P^. 

Formally, remembering the refiexivity of Pp, we can say that x G P belongs 
to Clf without any ambiguity if Rp{x) C Cly. Furthermore we say that with 
respect to P C C, ^ G P could belong to Cly if there exists at least one object 
X G Clf such that x dominates y with respect to P^ and y is indiscernible with 
X with respect to P^, i.e. y G Rp{x). Thus, with respect to P C C, the set 
of all the objects belonging to Cly without any ambiguity constitutes the lower 
approximation of Cly, while the set of all the objects which could belong to Clf 
constitutes the upper approximation of Clf. 

Formally, \/t G T and VP C C, we define the lower approximation of Cly 
with respect to P, denoted by PCly, and the upper approximation of Cly with 
respect to P, denoted by PCly, as: 

PClf = {x e U : Rp{x) C Ciy, 



PClf= y Rp{x). 

xecif 

The P-boundary (doubtful region) of Cly and Cly are respectively defined as 



Bnp(Ciy = PCl,^-PCiy 
Bnp(Ciy = PCl,^-PCiy 



\/t G T and VP C C we define the 
Cly as the ratios: 



accuracy of the approximation of Cly 

_ card(PCiy 
~ card(PClr)’ 



and 



o;p(Ciy 



card(PCiy 
card(PCiy ’ 



respectively. The coefficient 



TP (Cl) = 



card(P- ((Ug^Bnp(Ciy) U Bnp(Ciy ))) 

card(P) 



is called the quality of approximation of partition Cl by set of attributes P, 
or in short, quality of classification. It expresses the ratio of all P-correctly 
classified objects to all objects in the table. Each minimal subset P C C such that 
7p(C 1) = 7c (Cl) is called a reduct of Cl and denoted by REDci- Let us remark 
that an information table can have more than one reduct. The intersection of all 
reducts is called the core and denoted by COREci- 
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3 An example 

The following example (based on a previous example proposed by Pawlak [ 5 ]) 
illustrates the concepts introduced above. In Table 3 , twelve warehouses are 
described by means of five attributes: 

— Ai , capacity of the sales staff, 

— A2, perceived quality of goods, 

— A3, high traffic location, 

— A4, geographical region, 

— A5, warehouse profit or loss. 

In fact, Ai, A2 and A3 are criteria, because their domains are ordered, A4 
is an attribute, whose domain is not ordered, and A5 is a decision attribute, 
defining two ordered decision classes. More in detail we have that 

— with respect to Ai “high” is better than “medium” and “medium” is better 
than “low”, 

— with respect to A2 “good” is better than “medium” , 

— with respect to A3 “yes” is better than “no” , 

— with respect to A5 “profit” is better than “loss” . 



Table 1. Example of an information table. 



Warehouse 


^1 


A 2 


^3 


A 4 


As 


1 


High 


Good 


no 


A 


Profit 


2 


Medium 


Good 


no 


A 


Loss 


3 


Medium 


Good 


no 


A 


Profit 


4 


Low 


Medium 


no 


A 


Loss 


5 


Medium 


Medium 


yes 


A 


Loss 


6 


High 


Medium 


yes 


A 


Profit 


7 


Medium 


Medium 


no 


A 


Profit 


8 


High 


Good 


no 


B 


Profit 


9 


Medium 


Good 


no 


B 


Profit 


10 


Low 


Medium 


no 


B 


Loss 


11 


Medium 


Medium 


yes 


B 


Profit 


12 


High 


Medium 


yes 


B 


Profit 



3.1 The results from classical rough set approach 

By means of the classical rough set approach we approximate the class Cli of 
the warehouses making loss and the class CI2 of the warehouses making profit. 
It is clear that C = {Ai, A2, A3, A4} and D = {A5}. The C-lower approxima- 
tions, the C-upper approximations and the C-boundaries of sets Cli and CI2 are 
respectively: 
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cell = 5, 10 }, cell = { 2 , 3, 4, 5, 10 }, Bnc(Cli) = { 2 , 3}, CCh = {1, 6 , 7, 

8,9,11,12}, CCh = {1,2,3,6,7,8,9,11,12}, Bnc(Cl 2 ) = {2,3}. Therefore the 
accuracy of the approximation is 0.6 for the class of warehouses making loss and 

0.78 for the class of warehouses making profit and the quality of classification 
is equal to 0.83. There is only one reduct which is also the core, i.e. Red(C) = 
Core(C) = {Ml, A 2 , 713 , 714 }. 

Using the algorithm LERS (Grzymala-Busse [3]) the following set of decision 
rules is obtained from the considered decision table 3 (Table 1) (within brackets 
there are the objects supporting the corresponding rules): 

1. if f{x,Ai) = high, then x G CI 2 (1, 6, 8, 12) 

2 . if f{x,Ai) = medium and /(x, A 4 ) = B, then x G CI 2 (9, 11) 

3. if /(x,Ai) = medium and f{x^A 2 ) = medium and f{x^As) =no, then x G 
CI 2 (7) 

4. if /(x, Ai) = medium and /(x, A 2 ) = good and /(x, A 4 ) = A, then x G Cli 
or X G CI 2 (2, 3) 

5. if /(x, Ai) = low, then x G Cli (4, 10) 

6. if /(x,Ai) = medium and f{x^A^) = yes and f{x^A^) = A, then x G Cli 

(5) 



3.2 The results from approximations by dominance and 
indiscernibility relations 

With this approach we approximate the class Clp of the warehouses at most 
making loss and the class Cly of the warehouses at least making profit. Since 
only two classes are considered, we have Clp = CR and Cly = CR. When a 
larger number of classes is considered this equalities are not satisfied. 

The C-lower approximations, the C-upper approximations and the C-bound- 
aries of sets Clp and Cly are respectively: CClp = {4, 10 }, CClp = {2,3,4, 5, 7, 
10 }, Bnc(Cl^) = { 2 , 3, 5, 7}, CCl| = {1,6,8,9,11,12}, CCl| = ( 1 , 2 , 3, 5, 6 , 7, 
8,9,11,12}, Bnc'(Cly) = {2, 3, 5, 7}. Therefore, the accuracy of the approxi- 
mation is 0.33 for Clp and 0.6 for Cly while the quality of classification is 
equal to 0.67. There is only one reduct, which is also the core, i.e. REDci(C) = 
COREci(C) = {Ai,A4}. 

The following minimal set of decision rules can be obtained from the con- 
sidered decision table (within parentheses there are the objects supporting the 
corresponding rules): 

1. if /(x, Ai) is high, then x G Cly (1, 6 , 8 , 12) 

2. if /(x, Ai) is at least medium and f{x^A^) is B, then x G Cly ( 8 , 9, 11, 12) 

3. if /(x, Ai) is low, then x G Clf (4, 10) 

4. if /(x, Ai) is medium and /(x, A 4 ) is A, then x G Clf or x G Cly (2, 3, 5, 

7 ). 
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3.3 Comparison of the results 

The advantages of the rough set approach based on dominance and indiscerni- 
bility relations over the original rough set analysis, based on the indiscernibility 
relation, can be summarized in the following points. 

The results of the approximation are more satisfactory. This improvement 
is represented by a smaller reduct ({Ai,A 4 } against {Ai^ A 2 , A^, A 4 }). Let us 
observe that even if the quality of the approximation is deteriorated (0.67 vs. 
0.83), this is another point in favour of the proposed approach. In fact, this 
difference is due to the warehouses 5 and 7. Let us notice that with respect to 
the attributes ff.2, A3, which are criteria, warehouse 5 dominates warehouse 
7 and with respect to the attribute A4, which is not a criterion, the warehouse 
5 and 7 are indiscernible. However warehouse 5 has a comprehensive evaluation 
worse than warehouse 7. Therefore, this can be interpreted as an inconsistency 
revealed by the approximation by dominance and indiscernibility that cannot be 
pointed out when we consider the approximation by indiscernibility only. 

From the viewpoint of the quality of the set of decision rules extracted from 
the information table by the two approaches, let us remark that the decision rules 
obtained from the approximation by dominance and indiscernibility relations 
give a more synthetic representation of knowledge contained in the information 
table. The minimal set of decision rules obtained from the new approach has a 
smaller number of rules (4 against 6), uses a smaller number of attributes and 
descriptors than the set of the decision rules obtained from the classical rough set 
approach, obtains rules supported by a larger number of objects. Furthermore, 
let us observe that the rules obtained from the original rough sets approach 
present some problems with respect to their interpretation. E.g. rule 3 obtained 
by the original rough set approach says that if the capacity of the sale staff 
is medium, the perceived quality of goods is medium and if the warehouse is 
not in a high traffic location then the warehouse makes profit. One can expect 
that improving the quality of the warehouse, e.g. considering a warehouse with 
the same capacity of the sales staff and the same quality of goods but located 
in a high traffic location the warehouse should also make profit. Surprisingly, 
the warehouse 5 of the considered decision table has these characteristics but 
it makes loss. Finally, let us remark that rule 4 from the new approach is an 
approximate rule, as well as rule 4 from the classical approach. However, rule 4 
from the new approach is based on a small number of descriptors and supports 
a greater number of actions. 

4 Conclusion 

We presented a new rough set approach whose purpose is to approximate sets 
of objects divided in ordered predefined categories considering criteria, i.e. at- 
tributes with ordered domains, jointly with attributes which are not criteria. 
We showed that the basic concepts of the rough sets theory can be restored in 
the new context. We also applied the proposed methodology to an exemplary 
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problem approached also with the classical rough set analysis. The comparison 
of the results proved the usefulness of the new approach. 
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[ Abstract.] A heuristic method of model choice for a nonlinear 
regression problem on real line, based on the Equation Finder 
(EF) of Zembowicz and Zytkow (1992), is proposed and dis- 
cussed. In our implementations of the EF we use a new, actually 
a three-stage, procedure for stabilizing model selection. First, 
a set of pseudosamples is obtained from the original sample by 
resampling in some way. Second, for each pseudosample, a fam- 
ily of acceptable models is found by a clustering- like algorithm 
performed on models with largest (adjusted) coefficients of de- 
termination. And third, the final selection is made from among 
the models which appear most often in the families obtained 
in the second stage. 



1 Introduction and Outline of the Model Selection 
Procedure 

Discovering equations or functional relationships from data in presence of ran- 
dom errors should be an inherent capability of every data mining and, more gen- 
erally, knowledge discovery system. Such discoveries, in statistical terms, amount 
to choosing a model in a regression set-up. That is, given a random sample of 
pairs of variables, {xi,yi ), where yi = f{xi) + Si, i = 1,2, ... ,n, 
y^’s are zero- mean random variables and / is an unknown function, the task is 
to estimate / by a member of a given family of nonlinear functions and to assess 
validity of the model thus obtained. 

In this report, we present a new application of Equation Finder (EF), a 
discovery system of Zembowicz and Zytkow (see Zembowicz and Zytkow (1992) 
for a detailed description and discussion of the system and Moulet (1997) for the 
latter). The system finds “acceptable” models by means of a systematic search 
among polynomials of transformed data, when transformations - such as, e.g., 
logarithm, exponent and inverse - of both the independent and response variable 
are allowed. The family of all models to be considered is decided upon in advance 
by listing possible transformations of data and choosing the highest possible 
order of the polynomials. In the original version of EF, possible acceptance of 
a particular model is based on weighted least-squares (WLS) estimation of the 
model parameters and on the test of fit. 



L. Polkowski and A. Skowron (Eds.): RSCTC’98, LNAI 1424, pp. 68-74, 1998. 
@ Springer-Verlag Berlin Heidelberg 1998 
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Although EF uniquely determines the true model for samples from the same 
true model with sufficiently small error or large range of x values, it fails to 
provide sufficiently stable results for samples that carry larger error. As will be 
seen from simulation results in Section 3, for different samples from the same 
distribution the system provides results that may differ substantially from sample 
to sample. (Admittedly, this sort of instability is common to practically all the 
well-established methods of model choice in nonlinear regression.) It follows that 
some sort of stabilization of the original model selection procedure is needed, at 
least for regression problems with “larger errors” . 

In our implementations of the system, both the regression models and their 
parameters are provided by EF, the latter in accordance with the WLS method. 
But our choice of acceptable models (from those provided by EF) is different. It 
consists of three stages. First, a set of pseudosamples is obtained from the origi- 
nal sample by resampling combined with leave-many-out procedure: the original 
data are randomly permuted and the first pseudosample is obtained by leaving- 
out the first 10% of the data (for simplicity, we assume here that the sample size 
is divisible by 10); then the second 10% is left-out to obtain the second pseu- 
dosample, again of the size equal to 90% of the size of the original data, and so 
on, each time giving a pseudosample of the same size; once the last 10% of the 
data is left-out, the whole process, starting with permuting the original data, is 
repeated several times (e.g., if the sample size is 100, each permutation enables 
one to obtain 10 pseudosamples of size 90). Second, for each pseudosample, a 
family of acceptable models is found by a clustering-like algorithm performed on 
models with largest (adjusted) coefficients of determination. Looosely speaking, 
clustering rests on two-dimensional scaling based on distances between the mod- 
els. And third, the final selection is made from among the models which appear 
most often in the families obtained in the second stage. 

The rationale behind the method of choice of acceptable models is simple. 
Had we many independent sets of data, it would be reasonable to find a regression 
model for each set independently of the others, and then to aggregate the models 
obtained into one. Given just one set, we use it to produce many pseudosamples 
and proceed as if they were independent sets of data. In fact, our method of model 
selection borrows from stabilization ideas of Breiman (1996a) (see also Breiman 
(1996b)), with aggregation of multiple versions of a predictor^ essentially made 
by a plurality vote: the final selection is made from among the models which 
appear most often in clusters of models obtained for subsequent pseudosamples. 
It is believed, and has been confirmed by simulations, that predictors which are 
relatively close to the true model underlying the data are likely to form clusters 
in a properly defined space of models. 

In the next section, more details on the method will be given. Simulation re- 
sults will be discussed in Sect. 3. Although we shall confine ourselves to discussing 
just one example, several more have been thoroughly investigated, leading to es- 
sentially the same conclusions. All in all, the simulations show that the method 

^ Borrowing from machine learning terminology, we use the terms regression model 
and predictor interchangeably. 
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proposed can be considered a useful tool for model selection in the nonlinear 
regression context. 

In this report, we deal with discovering functions of one variable only, in 
accordance with the current capability of EF. In a separate report by the authors 
and J. Zytkow, possible extensions to discovering functions of several variables 
will be discussed. 

2 More on the Method 

Assume that the true model has the form 

r = /(A)+s-(A), (1) 

where / is a regression function (i.e., £^(£(X)|X) = 0) and V ar{e[X)\X) is 
a function of X. Unlike Zembowicz and Zytkow (1992) we do not assume the 
variances V ar{e[X)\X) to be known (in EF only the WLS method of estimation 
is implemented and therefore the system requires that the weights for the WLS 
equation be provided). Instead, given a sample (^i, ?/i), (^ 2 , ?/ 2 ), • • • , (^n, Z/n), we 
apply the Additive and Variance Stabilizing Transformation (Tibshirani’s (1988) 
AVAS with the Supersmoother of Friedman and Stuetzle (1982)) to obtain an 
auxiliary estimate / of / and we use 

Wi = [vi - 

as the weights in the WLS equation for computing parameters of any particular 
model provided by EF (this simple choice of weights has proved to work surpris- 
ingly well in the simulations). The EF is applied separately to each pseudosample 
generated as described in the previous section. 

For each fixed pseudosample (by abuse of notation to be denoted identi- 
cally as the original sample), the models obtained are ranked by the adjusted 
coefficient of determination 

2 _ WSSR/(n-p) 

WSST/(n-l)’ 

where, as usual, p is the number of parameters in the model, WSSR is the 
weighted sum of squares of residuals (with / denoting the model obtained), 

n 

wssR = J2My^-^X))^ 

i=l 

and WSST is the weighted total sum of squares, 

n 

wssT = J2Myi-y)^ 

i=l 

with y = n~^ Now, a fraction of models with largest is chosen for 

later investigation; either the fraction value can be determined a priori or a 
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threshold value of can be predetermined or the threshold value can be found 
by inspection: as a rule, “large” values of appear in a clear-cut cluster. In 
turn, the matrix of “empirical squared distances” between all pairs of the models 
chosen is formed, with the “empirical squared distance” between models fj and 
fk defined as 

n 

i=l 

In this way, a matrix of dissimilarities between the models is formed, which is 
then used to represent the models in a two-dimensional space via classical met- 
ric scaling (see, e.g., Krzanowski (1988); in our computations, S-Plus’s cmdscale 
function is used). The “most outlying” models in the 2D-representation obtained 
are removed and a new representation of the remaining models is obtained, again 
by classical metric scaling. This process of removing the “most outlying” models 
and building a 2D-representation of the models still in the family is performed 
repeatedly. In our implementations of such a clustering-like algorithm, we pro- 
ceeded in one of two ways. One was automatic: the number of repetitions of the 
process was determined in advance and, at each repetition, a fixed number of 
models was retained for further analysis. For instance, for n = 100, the process 
was repeated 3 times, with the numbers of models retained equal, respectively, to 
20, 15 and 10. The most outlying models were defined as those farthest from the 
models’ center of gravity in the 2D-representation. Surprisingly, despite obvious 
drawbacks of this criterion, it proved to work generally well in the simulations. 
The other way was subjective: the outlying models were found by inspection, 
and the process of removing models and re-representing them in a 2D-space 
was repeated until no obviously “outlying models” could be found (usually, the 
process was again repeated thrice). Subjective detection of “outliers” was facil- 
itated by constructing also a dendrogram of models, again based on the same 
dissimilarities between the models (S-Plus’s hclust function for complete linkage 
clustering was used). 

Equipped with the outcomes of applying the clustering-like algorithm to all 
the pseudosamples, the final selection of acceptable models can be made from 
among the models which appear most often in the families obtained for the 
pseudosamples. That is, each model is ranked according to the number of pseu- 
dosamples for which the model has been retained after applying the clustering- 
like algorithm and the final selection is made from among models with highest 
ranks (the given way of ranking is referred to as the plurality vote). 

3 Simulation Results 

In our example, the regression function was of the form 

f{x) = 1 H , 

X 

X G [1,10], and the e’s were normally distributed with homogeneous variance, 
V ar[£{X)\X) = 0.02. Equation Finder was set at search depth equal to one. 
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with transformations SQR, SQRT, EXP, LOG, INV, MULTIPLICATION and 
DIVISION, and with maximum polynomial degree equal to three. Fifty i.i.d. 
random samples (Ai, Ti), (A2, U2), • • • , (Aioo, Tioo) of size 100 were generated 
from model (1). 

The EF was first used on all 50 samples and stability of solutions obtained 
was assessed. For each sample, models provided by EF were ranked by their 
WSSR’s and the best 40 were subjected to two additional rankings, one based 
on the empirical mean squared error, 

1 "" 

EMSE = -J](/>,)-/(x,))2, 

i-1 

and another on the mean squared residual error, 

i=l 

that is, the 40 models were ranked by their EMSE’s and MSRE’s. Then, only 
the models which appeared for all 50 or for 49 samples were retained for stability 
assessment. For each such model, two histograms were constructed: one for the 
model’s EMSE ranks in all samples and another for the model’s MSRE ranks in 
all samples. 

In figure 1, histograms for EF models: No. 64 {f{x) = x{a-\- b/x T c/x^)). 
No. 113 {f{x) = a T b/x)^ No. 158 {f{x) = (a T b^/x T cx)/x)^ No. 262 {f{x) = 
x/(a T 61ogx T clog^ X T dlog^ x)). No. 359 {f{x) = log(a T T cx T 
and No. 464 (/(x) = b/x: T c/x^) are given for the sake of illustration (for 

each model, the two histograms are depicted together, with the adjacent bars 
belonging alternately to the first and the second histogram: in the leftmost pair 
of bars, ranks 1-3 according to EMSE and to MSRE, respectively, are grouped, 
in the second leftmost pair, ranks 4-6, etc.). We note that models 64, 113, 158 
and 464 are all equal to / provided their parameters assume suitable values. 

It is apparent from all the histograms obtained that, whatever the form of a 
particular model, the MSRE ranks hardly follow those based on EMSE. While 
the EMSE rankings for models 64, 113, 158 and 464 point to the possibility of 
dealing with data from /, it is by far not the case for the MSRE rankings. Rather, 
by the “MSRE standards” it is model 262 which should be considered most 
likely to be the true one. Generally, these are strong indications that not only 
the solutions are unstable in the given sense but we cannot rely unconditionally 
on the LS methodology when choosing a model (we do not have to refer here 
specifically to the WLS methodology since the error variance is homogeneous in 
the example under scrutiny). On the other hand, the plurality vote performed 
after applying the (automatic) clustering-like algorithm to all the 50 independent 
cases ranked model 158 as the first one, model 464 as the second, model 64 as 
the fourth and model 113 as the seventh one. 

In reality, of course, we are most often given just one sample. Table 1 provides 
a summary of results obtained for three example samples (actually, samples No. 
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Fig. 1. Histograms of EMSE and MSRE for six example models 



21, 25 and 36), when each sample was used to generate 50 pseudosamples (as 
described in Sect. 1) and the automatic clustering-like algorithm from Sect. 2 
was applied to the pseudosamples. In the first three columns, results of rankings 
based on the EMSE, MSRE and WSSR are given for the four models mentioned 
above. The models’ ranks which resulted from the plurality vote are given in the 
last column along with the number of pseudosamples for which the model was 
retained by the clustering- like algorithm (the number given in brackets). Judged 
by the concord ance/discordance between the EMSE and MSRE, samples 21 and 
36 can be considered a medium-hard basis for estimating the regression function 
sought, while sample 25 can be considered a hard one. Eor sample 21, the best 
model according to the plurality vote scored 36 votes, and the subsequent scores 
were: 33, 32, 28, 27, 27, 26, 22 six times, etc. Eor sample 25, the best model 
scored 28 votes and the subsequent scores were: 27, 26 twice, 25 twice, 23, 22 
four times, etc. Eor sample 36, the three best models scored 49 votes and the 
subsequent votes were: 47 twice, 44, 41, 35, 31, 28, 20, 16, 14, 5, etc. Results 
reported for the three samples are rather typical when compared with those for 
other samples in their categories. 



As the results summarized in the table are rather typical for other samples 
(and other examples studied), it is clearly seen that the stabilizing method can 
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Table 1. 



sample No. 21 


sample No. 25 


Model 


EMSE 


MSRE 


WSSR 


After 


Model 


EMSE 


MSRE 


WSSR 


After 


No. 


rank 


rank 


rank 


vote 


No. 


rank 


rank 


rank 


vote 


464 


2 


16 


16 


8 (22) 


158 


6 


30 


32 


8 (22) 


113 


3 


22 


21 


8 (22) 


113 


7 


28 


33 


12 (21) 


64 


4 


21 


20 


8 (22) 


64 


8 


27 


31 


8 (22) 


158 


5 


17 


18 


8 (22) 


464 


9 


30 


34 


8 (22) 


1 sample No. 36 
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113 


1 


22 


19 


9(31) 












464 
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21 


22 


4 (47) 












158 
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20 


21 
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64 


7 


19 


20 


1 09) 













be used to our advantage. It is worth noting that the fit, used in the original 
version of EF, is essentially based on the WSSR. 

A final choice of reliable models from those with the best scores (obtained 
from the plurality vote) can be done by a separate analysis. In particular, after 
possible negligibility of some components of the models under scrutiny has been 
taken into account, such an analysis can be based on the distances between those 
models, their plausibility in view of some other information, etc. 
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Abstract. A statistical infrastructure based on the concentration mea- 
sures is presented. It aims at being a consistent system of descriptive 
parameters with a clear interpretation, summarizing knowledge on dis- 
tributions of variables in populations of objects. The system requires 
unified formulas for sets of mixed continuous-categorical variables. These 
demands are met e.g. by the formulas given in Sec. 3 for the bivariate 
dependence measures, presented as suitably weighted averages of concen- 
tration indices for pairs of conditional distributions. Further, in Sec. 4, 
a concentration oriented modification of the statistical procedure called 
correspondence analysis is mentioned and exemplified by applying it to 
data from the Polish Parliament Elections in 1993 and 1997. Sec. 5 con- 
tains remarks on creating an inference and computing system adjusted 
to this new concentration - based approach to statistical descriptive pa- 
rameters. 



1 Introduction 

Statistical infrastructure appearing in the title of this paper refers to descrip- 
tive parameters which summarize our knowledge about distributions of sets of 
variables in populations of objects. These descriptive parameters are then ex- 
ploited in various decision making schemes. It is well known that very simple 
data models, the multinormal model in particular, can be easily described and 
led usually to easy and consistent patterns of inference and computing. How- 
ever, models too simplified are not helpful in solving real world problems. Many 
efforts have been done to construct models more complicated than multinormal 
but retaining a similar simple (’’linear”) inference and computing pattern. The 
most popular among them is the log-linear model applicable to sets of ordi- 
nal and nominal categorical variables (and thus also to discretized continuous 
ones) (cf [1]). The log-linear inference and computing system is very similar to 
that in the multinormal case. Computing is well-organized; it enables dealing 
simultaneously with many inference problems. This is achieved due to a special 
log-linear infrastructure which aims at transforming sets of categorical variables 
onto ” linear- like” models. But neither descriptive parameters nor inference have 
a convincing interpretation and justification. 
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Some recently developed computational systems refer to rather narrowly de- 
signed specific problems and admit specific families of decision rules (see e.g. [8] 
a comparative study of exploratory methods related to the DEDICOM system 
meaning DEcomposition into Directional COMponents). Various approaches to 
a data set may provide helpful complementary information but may also lead to 
a total mess. 

Let us stress that it is the computing system which forms an interface be- 
tween a specific inference system and its users. Such a system plays the role of 
a specific language. Users often rely too strongly on loose interpretation of the 
results provided by the computing system such as "significant”, "strongly de- 
pendent”, "redundant”, "outlier", etc. They become accustomed to a particular 
computing system and are reluctant to make a change. This is why new ideas are 
not easily conveyed to practitioners. To have any chance, new ideas should be 
incorporated into a ready-made computing system. In the sequel we tell about a 
new computing trend in data analysis which is now in statu nascendi. This new 
trend is induced by statistical infrastructure based on concentration measures. 

It is astonishing that statistical infrastructure based on concentration mea- 
sures is being developed so slowly although basic ideas were formulated by Gini, 
Kakwani and Eogelson as early as the beginning of the 20th century (cf. e.g. 
[12]). Obviously, we will not even try to trace here the main publications in this 
area. Our own contribution has been reached in a team (consisting of the authors 
of this paper together with T. Bromek, W. Szczesny, M. Niewiadomska-Bugaj 
and W. Wysocki). The team’s results can be found in the bibliography at the 
end of the present paper. The important starting points were: [5], [10], [11], [13], 
[21], [23]. We aim at constructing a consistent system of descriptive tools based 
on concentration measures, defined in a unified way for mixed data (i.e. for sets 
of variables measured on various measurement scales). We tried to show how 
this system of descriptive tools is linked with important inference problems and 
how this worked in a number of case studies [2], [8], [9], [10], [19], [22], [24]. 

In the present paper we restrict ourselves to sketch the descriptive tools used 
to compare one probability distribution with another and to describe bivariate 
dependence. This allows to indicate our main inference and computing proce- 
dures and the corresponding visualization tools. A few illustrations are made 
using data which concern two last elections (1993 and 1997) to the Polish Par- 
liament. 

2 A look on concentration measures 

The vague idea of concentration is a fundamental informal statistical notion. 
Formal definitions of concentration measures serve to create the basic structure 
of important descriptive tools. 

Let us start with two categorical variables X and Y valued 1, . . . , /c with 
probabilities pi^ ... ^Pk and gi, . . . , respectively. The concentration curve of Y 
w.r.t. X related to the natural order of points in {1, . . . , /c}, consists of segments 
which join points (0,0), (pi,^i), (pi +P 2 , + ^ 2 ): •••:(1:1)* To illustrate, let X 

and Y be two nominal categorical variables with probabilities equal to fractions 
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of votes gained by 10 political parties in the Polish Parliament Elections in 1993 
(for X) and in 1997 (for Y) (cf [24]). To be precise, there are 7 parties and 
3 ’’quasi parties” corresponding respectively to non- voters, those who issued 
formally invalid votes, and those who voted in 1993 on a party which vanished 
in 1997 (in case of X) or vice versa (in case of Y). Suppose that the parties and 
quasi parties are labeled so that the likelihood ratios Qi/pi are increasing (i.e. 
the parties are ordered from that which had the maximal loss to that with the 
maximal gain). The respective concentration curve is shown in Fig. 1. It describes 
changes in relative concentration of votes fractions. It is convex and consists of 
6 longer segments (1, 2, 6, 7, 8, 9) and four very short ones corresponding to 
less popular parties. Number 7 refers to non- voters. Comparing the slopes of 
particular segments with the diagonal y = x we see that the fraction of non- 
voters slightly increased in 1997, parties 8 and 9 also belong to ’’winners” with 
increased fractions of voters, the fraction corresponding to party 6 is practically 
unchanged, parties 1-5 are ’’losers”. 




Fig. 1. The absolute concentration curve of the distribution of votes in the 1997 
election to that in 1993 election. 



For nominal variables the ordering related to increasing likelihood ratios 
seems to be the only one worth consideration. For ordinal variables it may be 
not the case. For instance, let X and Y be two income distributions for k ordered 
income categories selected a priori. Then one may be interested in the concen- 
tration curve related to the order 1 , . . . , /c which may be different from that 
corresponding to increasing likelihood ratios. Thus we may draw both curves; 
the second one will evidently lie under the first. 

For continuous variables X and Y with distribution functions F and G the 
situation is similar; the ratios qi/pi are replaced by the ratio of densities h{x) = 
The concentration curve corresponding to h will be called the absolute 
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concentration curve (cf [3]) and denoted CmaxiX : It lies under the directed 

concentration curve C{Y : X]ip) for any (f : O ^ R which orders the set of 
elementary events on which X and Y are defined. 

In the general case, distributions of X and Y may be replaced by any proba- 
bility measures P and Q defined on (i7, Al). Curves C(Y : X;(f) and CmaxiX • 
are function- valued measures of differ entation between P and Q. This different a- 
tion is reflected by the position of a particular curve with respect to the diagonal 
y = X which corresponds to equal P and Q. Departures of a concentration curve 
from the diagonal, positive under y = negative over it, serve to construct 
summary measures of different at ion between P and Q. Thus we have 

ar{Y :X;X= f (t ~ C{Y : X; XX) 

Jo 

armax{y ■ X) = [ {t Cmax {Y:X){t))dt. 

Jo 

If X and Y are at least ordinal and (p(x) = x, we omit (f and write ar{Y : X). 

3 Dependence measures based on concentration indices 

Let us start with a pair of categorical variables (X, T) with probabilities 
i = 1, . . . , m, j = 1, . . . , /c. Then, the vectors defining the marginal distributions 
of X and Y are {pi+ = {p+j = Y^TLiPijJ = 

and the conditional distributions of T |X = i and of X|T = j are: 

Pi = Py\x=z = {Pii/Pi+, ■ ■ ■ ,Pik/Pi+), i = 1, . . . , m, 

Qj ~ Px\Y=j ~ {P^j /P+j ^ ■ ■ ■ 1 Pmj /P+j)i J ~ 1) • • • ) 

Variables X and Y are independent if and only if all conditional distribu- 
tions (of y on X and of X on V) are equal. If one tries to ground the notion 
of positive dependence on the notion of concentration then, the further down 
the diagonal y = x are the concentration curves of V|X = X 2 w.r.t. V|X = xi 
for any x\ < ^ 2 , the stronger should be the positive dependence of Y on X. 
This introduces a suitable ordering in the set of pairs (X, V) related to posi- 
tive dependence. A global measure of this dependence could be based on the 
concentration indices of V|X = X 2 w.r.t. V|X = x\ directed according to Y. 
Therefore, for X valued l,...,m and Y valued l,...,/c, we introduce summary 
measures of positive dependence of V on X in the form of linear combinations 
of ar{Pt : Pg), 5 < t, = 1, ...,m. The leading role of concentration measures 
in creating statistical descriptive tools is illustrated by the fact that (cf [15]) a 
popular measure of positive dependence, known as Spearman’s rho and denoted 
p*, is representable in this form. Due to generality of the definition of the con- 
centration index, the definition of p* is immediately generalized to the whole 
family of pairs of continuous or categorical- continuous variables. 

To mention another link between p* and ar (cf e.g. [11]), we note that the 
distribution of any pair (X, V) may be mapped, by means of the so called 
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grade transformations performed simultaneously on each marginal distribution, 
into a continuous distribution (with uniform marginals) on the unit square. 
Let (X*,y*) denote the respective “grade representation” of (X, y). We have 
y) = /?*(X*,y*) = cor(X*,y*), and p* is therefore called the grade cor- 
relation between X and Y. What is essential here, ff{X,Y) = 3ar(r(X*) : X*) 
where r is the regression function of y* on X*. 

4 A look at the grade correspondence-cluster analysis 

We remind that for any X, y \ Q ^ R the concentration index ar{Y : X) is 
accompanied by the absolute concentration index ar^ax which is the maximum 
value of ar(y : X;(f) over all possible orderings (f in fl. Similarly, for a pair 
(X, y) : i7i X i ?2 ^ p"^{X,Y) is accompanied by p"^^^{X,Y) which is the 

maximum value of p*(/(X), ^(y)) over all possible pairs of orderings in i?i and 
introduced respectively by / and g. Transformations /*, g* of X and Y 
which achieve the maximum value of p* provide a best” approximation to 
positive dependence. This resembles an important statistical procedure called 
the correspondence analysis which is based on transformations maximizing the 
Pearson’s correlation coefficient. Our procedure has been called in [5] the grade 
correspondence analysis (GCA). According to Sec. 3, p^ax concentration 

index of the regression function of ^*(y*) on /*(X*), multiplied by 3. 

GCA proved to be a very useful exploratory procedure, apt to a convenient 
visualization ([7], [9], [22]). To illustrate, we will use once more the Parliament 
elections data. Let {pij^ i = 1, . . . , 52, j = 1, . . . , 10} be the two-way table with 
rows corresponding to election regions (voivodships) and columns corresponding 
to the parties and quasi parties, while pij is the fraction of votes gained in region i 
by party j in the 1993 election. A similar table {% } refers to the 1997 election. 
Each table was first considered separately, but a more interesting result was 
obtained when the two tables were combined into one, containing still 52 rows 
(regions) but 20 columns (10 parties, each appearing twice, once in 1993 and once 
in 1997; parties corresponding to 1997 are denoted 1, 2, . . . , 10 as introduced in 
Sec. 2, their counterparts in 1993 are denoted 11, 12, ... , 20, respectively). The 
combined 52 x 20 table has been transformed by the GCA procedure onto a 
table {iVij} with suitably permuted rows and columns. This allows tracing and 
interpreting latent traits which implied the orderings of voivodships and parties 
obtained due to GCA. We will analyse this latent structure looking at the results 
of GCA visualized in Fig. 2. 

Table {'Kij} is mapped in Fig. 2 into a unit square with 52 x 20 rectangles. 
The width and length of rectangle (i, j) is 7 t^+ and 7r+j. The product 7r^+7r+j 
is equal to the fraction of votes expected under ’’fair representation” related 
to fraction of votes ascribed to region i and total gain of votes by party j in 
the first or second election. The quotient Tiij is the overrepresentation 

index. Four thresholds of this index have been chosen: 2/3,99/100,100/99,3/2. 
The respective intervals of the index were called: strong underrepresentation, 
weak underrepresentation, almost fair representation, weak overrepresent at ion. 
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strong overrepresent at ion. Strong and weak underrepresentation were marked 
white and light grey, weak and strong overrepresent at ion were marked dark grey 
and black, almost fairness was marked as white with vertical lines. The GCA 
transformation forced a concentration of black and dark grey rectangles in the 
upper left and lower right corners of the unit square and along the diagonal which 
joins these corners. This concentration indicates the trend of overrepresentation 
as a relation between parties and election regions. 




Fig. 2. A graphical presentation of the table containing votes fractions gained 
by 10 political parties in 1993 and 1997 elections to the Polish Parliament in 52 
voivodships. Rows and columns are transformed due to the grade correspondence 
analysis. 



The map in Fig. 2 provides much information. It is clear that the pairs of 
columns which refer to the same party or quasi party in both elections tend to 
keep together: 6 with 16, 9 with 19, 8 with 18, 7 with 17, 1 with 11. The con- 
ditional distributions of columns, referring to the same party are rather similar 
(the 1993 and 1997 columns of party 8 are even amazingly similar!). The voivod- 
ships are ordered starting from such big towns in the west and central Poland 
as: Warsaw(l), Gdahsk(ll), Krak6w(21), Poznah(35), L6dz(27), Wroclaw(50), 
through Katowice(16), Gliwice(17), Bydgoszcz(6), to smaller rural districts in 
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the eastern part of Poland. This seems to meet the ordering of parties, from 
the most liberal ones (6) to party 1 representing mainly the inhabitants of ru- 
ral areas. Two very large columns 17 and 7 refer to non- voters; it is seen that 
the percent of non- voters tends to concentrate rather in the rural voivodships, 
especially in 1997. 

Usually, the GCA should be followed by a simultaneous clustering of rows 
and columns. The idea of clustering following GCA is to divide rows as well 
as columns into non- overlapping clusters which will retain positive dependence 
after aggregation; the aggregated table transformed by the GCA will remain 
unchanged. These problems are tackled in the papers [4], [5], [16], [20] but no 
general optimal solutions are yet available. 

5 Closing remarks 

The previous sections gave only a superficial insight into the construction of 
descriptive parameters and inference procedures induced by the concentration 
measures. In particular, we didn’t mention the computer-intensive tests intro- 
duced to make a suitable reference distribution to a considered descriptive pa- 
rameter (cf [7], [9], [10]). This is specially important in case of since it 

is necesary to check whether the GCA transformations and trends are mean- 
ingful. It is also possible to introduce a computer-intensive estimation, using a 
sufficiently regular one-parameter model as reference (cf [2]). 

It follows that there are three directions of computing which have to be 
considered. The main one concerns procedures concerning GCA and inference 
based on it (e.g. variables selection, outliers detection, discrimination). Two other 
directions concern visualization and computer-intensive testing and estimating. 
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[ Abstract.] Networks of parallel language processors form a 
special type of grammar systems where derivation steps in the 
mode of L systems and communication steps alternate. We 
prove that, given an arbitrary system, an equivalent system 
with at most three components can be constructed. This im- 
proves a result of [2]. Furthermore we consider the function 
f{m) which gives the number of words generated in m steps. 
We relate this function to growth functions of DOL systems 
and prove the undecidability of the equivalence with respect to 
this function. 



1 Introduction and Definitions 

Grammar systems have been intensively studied in the last decade (see [1], [4]). 
Essentially, there are two variants of grammar systems. The components of the 
system rewrite a common sentential form, or the components have their own 
sentential forms and some communication steps are performed. 

In [2] grammar systems of the second type called networks of parallel language 
processors (NLP for short) have been introduced. The process of generation con- 
sists of derivation and communication steps which alternate. The derivation step 
is a parallel rewriting of all letters of the current sentential form as in Linden- 
mayer systems. In a communication step, any component sends its sentential 
forms which satisfy its exit condition to any other component, and any com- 
ponent receives only those words which satisfy its entrance condition and adds 
these words to its sentential forms. 

In [2] it is shown that the corresponding families of generated languages 
coincide with some well-known families of languages generated by ETOL systems 
with a control of the application of tables. We shall improve these results for some 
control mechanisms. We show that systems with three components are sufficient 
to generate any language which can be generated by such systems (with an 
unbounded number of components). 

Eurther, in [2] some results are given on the cardinality of the multiset of 
words generated by NLP systems in a certain number of steps. We consider the 
function f{m) which gives the number of words generated in rn steps. We relate 
this function to growth functions of DOL systems and prove that it is undecidable 
whether or not the functions of two systems are equal. 
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We now give the formal definitions and some notation. Throughout the paper 
we assume that the reader is familiar with the basic notions of the theory of 
formal languages especially of L languages (see [5], [6]). 

By l[w) and \M\ we denote the length of a word w and the cardinality of the 
set M, respectively. 

A conditional ETOL system (with n tables , n > 1) is a construct 

G={V,V',Qi:Fi,Q2:P2,...,Qn-Pn,nj) 



where 

- (E, W, Pi, P 27 • • • 7 Pri 7 is a usual ETOL system with the alphabet E, the 
set E' C E of terminals, the production sets Pi, P 2 , . . . , Pn and the axiom 
and 

- ^ 1 , ^ 2 , • • • , mappings from V* into { true , false}. 

The language generated by a conditional ETOL system as above consists of all 
words z G {V'Y such that there is a derivation 

U7 = U?o ^1 ^2 =^Pi3 ' ' ' ^m-1 ^^Pi^ = 2:" 

for some m > 0 and Qi. {wj-\) = true for 1 < j < rn. 

Mostly we are only interested in some special conditions. 

We say that a condition ^ is a regular context condition^ if there is a regular 
language R such that g{w) = true holds if and only if G P- 

We say that a condition p is a random context condition^ if there are finite 
sets Q and R such that g[w) = true holds if and only if any letter of Q occurs 
in w and no letter of R occurs in w. 

We say that a condition ^ is a forbidden context condition^ if it is a random 
context condition where the set Q of required letters is empty. 

We shall write g = R and g = R) m case of a regular and random context 
condition, respectively. 

By P(re^)P 0 L, E{rc)T0L and E{for)T0L we denote the families of lan- 
guages generated by ETOL systems with regular, random and forbidden context 
conditions, respectively. 

An NPL_E0L system (of degree n, n > 1) is a construct 

r = {V, {Pi,Fl, Ql, C^l), (-P2, F2, Q2, CT2), • • • , {Pn, Fn, Qn, CTn)) 



where 

- E is an alphabet, 

- for 1 < i < n, PMs a finite subset of E x E* such that E = {a | (a, v) G P^}, 

- for 1 < i < Ei IS 8i finite subset of E*, 

- for 1 < ^ < n, and are mappings from E* to { true , false}. 

The quadruples (P^, Ei^ a^), 1 < ^ < n, are called the components or nodes 
of the system, and Ei^ gi and ai are called their set of productions, set of 
axioms, exit filter and entrance filter, respectively. Note that any P^, 1 < i < n, 
is a usual set of productions of an OL system. 

A configuration C of an NPL_E0L system P as above is an n-tuple of lan- 
guages over E, i.e. C = (Li, L 2 , . . . , An) where Li C V* for 1 < i < n. 
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Let C = (Li, L2 , . . . , Ln) and C' = , L2 , . . . , L'^) be two configurations of 

an NPL_F0L system F as above. Then we say that 

- C' is obtained from (7 by a derivation step, written as C => C\ if for 
1 < i < n 

L\ = {w : vj' w for some w G Li} 

- C' is obtained from C by a communication step, written as C h C", if for 
1 < i < n 

n 

L[ = Li VJ \^{w \ w ^ Lj^ = tTue }, 

j=i 

With any NPL_F0L system F as above we associate a sequence of configura- 
tions 

— (-^777,^, 7^777,2 , . . . , Z/777,^77,), TTT N 0 

such that 

Co = (Cl, F2, . . . , Fn) and C2k C 2 k^i and C2 A;+i F C2k^2 for > 0 
and define the language L[F) generated by F by 

L{F) = T2fc+i,i- 

k>0 

We call an NLP_F0L system an NLP_0L system if, for 1 < ^ < n, any set Fi 
contains exactly one element. Moreover, we call an NLP.FOL system propagating 
if, for 1 < 2 < n, C7 is a subset of F x F+. Furthermore, an NLP_F0L system 
is called deterministic if, for l<i<n^a^aEPi and a ^ [3 £ Pi imply 
a = f3. If the system under consideration is propagating or deterministic we add 
the letter P or D, respectively. Thus we get NLP_PF0L, NLP_DF0L, NLP_DP0L 
etc. systems. 

By N LPn-FOL and NLP-FOL we denote the families of all languages gen- 
erated by NLP_F0L systems with n components and of all languages generated 
by NLP_F0L systems (without a restriction concerning the degree). NL/C_0L, 
N LPn-P POL ^ N LP_DF0L etc. are defined analogously. 

Again, we are mostly interested in filters which are regular or random or 
forbidden context conditions. The corresponding language families are denoted 
by {reg)N LPn-FOL^ {rc)N LPn-FOL and [f or) N LPn-FOL^ and accordingly the 
notation is used in case of deterministic, propagating and/or single axiom sys- 
tems. 

[ Example L] The NLP_0L system 

r = ({a, A, F}, (Pi, {a}, cri)(P2, {A}, 92 , (^ 2 )) 

with 

Pi = {a ^ a^}, P 2 = {A ^ , A ^ a, a ^ F} 

and the filters (regular context conditions) 

Qi = 0, cn = {a}*, 92 = {a}*, cr2 = 0 

generates the language 

L[F) = {a^ I n > l,m > 0}. 
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2 The Number of Components 

We start with a simple observation. 

[ Lemma 1.] i^orn > 1, y e {F0L,0L, PFOL, POL} andx e {{reg),{rc),{for)}, 



xNLPn-Y C xNLPn^i-Y^ 



[ Proof, ]Let L G {reg)NLPndY be generated by an NLP_Y system F with n 
components, with the alphabet V and regular context conditions. Then we define 
the NLP_Y system F' by adding the component [{x ^ x \ x E P}, 0 , 0 , 0 ) which 
does not obtain words from other components and does not send words to other 
components by the definition of its exit and entrance filter. Thus L{F) = L{F') 
which proves [reg)NLPn-Y = {reg)NLPn^i-Y, 

The modifications for random and forbidden context conditions are left to 
the reader. 

We now investigate the hierarchy given in Lemma 1 in more detail. 

[ Theorem 1.] For x G (reg)^ (rc), (for) and Y G {T0L,0L, PFOL^ POL}^ 

Y = xNLPi.Y C XNLP2-Y, 

[ Proof, ]The equality Y = NLPidY holds by definition. 

By Example 1 L = | n > l,m > 0} G {reg)N LP2-Y, On the other 

hand, L ^ FOL can be shown by standard methods. This proves that the in- 
clusion [reg)N LPi_Y C [reg)N LP2-Y from Lemma 1 is proper in the case of 
regular context conditions. 

For the case of random and forbidden context conditions as filters, we mod- 
ify the filters in Example 1 such that they are random or forbidden context 
conditions and have the same effect as the regular context conditions. 



[ Theorem 2.] For n > 3^ Y G {OL^FOL} and x G {for)}^ 

FxTOL = xNLPn-Y and FxPTOL = xNLPn-PY. 

[ Proof, ]We only give the proof for Y = FOL and x = {reg). The necessary 
modifications for the other cases are left to the reader (and refer to [3]). 

By [2] Theorem 4.3 and Lemma 1, 

{reg)NLPs-F0L C [reg)NLP4-F0L C . . . C {reg)NLP.F0L C E{reg)T0L, 

Hence it is sufficient to show that F[reg)T0L C [reg)N PL^-FOL. 

Let L G F{reg)T0L and let L = L(G) for some ETOL system 

G = ( y, y ' , 1 : Pi , 2 : T2 , • • • , n : Tn , ^ ) 

with n tables and regular languages P^, 1 < i < n, as conditions. For 1 < i < 
we set 

v(i) = {xW :xeV}. 
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Moreover, for a word y = X 1 X 2 . . -x^n with Xj G V for 1 < j < rn^ a set K C V* 

and 1 < i < n, we set = x^^x^^ , , , Xm and \ y G R}, 

We now consider the NPL system 

r = {W, {Qi, Fi, Qi,(Ji), {Q 2 , F 2 , 92, <^ 2 ), {Q 3 , Fs, 93 , 0 - 3 )) 

where 

n 

ic = r u U u{S'} (S' is an additional letter), 

i=l 

Qi = {z ^ z \ z e W}, 

Ci = 0, ^>l = W^^ 03 = {V')*, 

Q 2 = {x ^ I X € \/} U ^ x0^^) |x€ V, 1 < i < n — 1} 

U ^x:x€ \/}U{S'^ w}, 

C2 = {S}, 92 = W*, 02= V*, 

Q3 = {x^x|xg V} U {x0^ vJx '■ X vJx & Fi, 1 < * < n} U {S ^ S}, 

n 

F 3 = 0, 93 = i>, CT3 = y i^0. 

i=l 

Now assume that there is a derivation 

^ =^^2 ^2 =^Pz3 ' ' ' (^) 

in G with m > 1. By induction on rn it is easy to show that 

Wm ^ +2z2...2zm.+2m+l,3* (2) 

Now let z G L{G). Then z = w E {V'Y or there is a derivation (1) with m > 1 
and z = Wm £ (W)*. In the former case w G Ti,2j ^ ^ ^2,1 by a communi- 
cation step and w G ^3,1 by a derivation step, and in the latter case Wm € 
^2 zi+ 2 z 2 ...2zm,y2my2,i by a communication step and Wm ^ -b222y2z2...2zm,y2my3,i 
by a derivation step. Thus z G T(T) and hence L{G) C L[F), 

Conversely, by induction on the number of steps it is easy to prove that, for 

m > 1, 

- if G Lm,2 U Lm, 3 ^ then w G S{G) or w = zb) for some z G S{G) and 
1 < i < n, 

- ^m,l C L{G). 

Therefore T(T) C L(G). 

Thus L{G) = L{r) and E{reg)T0L C {reg)N PL 3 -FOL. 

Combining these results, for x G {reg)^ if or)} and Y G OL}, we obtain 

the hierarchy 

Y = xNLPi.Y c XNLP 2 -Y C xNLPs-Y = xNLP^-Y = . . . 

. . . = xNLPY = ExTOL. 

The only problem remaining open is the question whether or not two components 
are as powerful as three (or more) components. 




J. Dassow 



3 Growth Functions of NPL Systems 

[ Definition l.]Let x e {{reg)^ [rc)^ [for)}. For an xNFL_F0L system F with 
n components and 1 < i < n we define 

(i) the growth functions fi^p : N ^ N of F at node i by fi^r{m) = \L m,i I 

and 

(a) the growth function fp:'N^'Nofrhy fr{m) = ffri^n). 

In [2] growth functions of NLP_F0L have been introduced. However, there 
Lm,i is considered as a multiset whereas we consider it as a set. 

[ Theorem 3.] Let x E {{reg) fire) for)}. Then it is undecidahle whether or 
not the growth functions or the growth functions at some node coincide for two 
given (x)NPL_P0L systems with at least three components. 

[ Proof. ]We only give the proof for x = (re^). The obvious modifications for 
X E {(^c), (/or)} are left to the reader. 

For an arbitrary number n > 1, we consider the (reg)NPL_POL system 

r = (C {Pi,{c}, V*, V* \V*{d}V*), {P 2 ,{d}, V*, V* \W{cW*), {Ps, {c}, 0, V) 

where 

V= {a,b, c,d,[l],[2],...[n]}, 

Pi = {c^ [i] ca I 1 < i < n} VJ }x ^ X \ X E X c} ^ 

P 2 = {d ^ [i]dh I l<i<n}u{x^x\xE V^x d|, 

P 3 = {c ^ d} U {x ^ X \ X E F \ {c}}. 

Then we obtain 

/i,r(2m) = /i,r(2m - 1) = / 2 ,r( 2 m) = / 2 ,r( 2 m - 1) = n”* for m>0, 
f3,r{0) = /3,r(l) = 1, 

m 

fs^r{2rn) = fs^p{2rn + 1) = 1 + 2 • for m > 1. 

k=0 

With an instance I = ri), ^ 2 ), • • • , ^n)| of the Post Correspon- 

dence Problem over some alphabet U we associate the (reg)NPL_0L system 

(Qi, W, IT*, IT* \ IT*{d}IT*), (Q2, {d}, IT*, IT* \ IT*{c}IT*), 

(g3,W,0,ir*)) 

IT = [/ U {c, d, [1], [2], . . . , [n]}, U n {c, d, [1], [2], . . . , [n]} = 0, 

Qi = {c ^ [i]cui I 1 < i < n} U {x ^ X \ X E X ^ c} ^ 

Q 2 = {d ^ [i]dvi I 1 < i < n} U {x ^ X \ X E X d}, 

Qs = {c ^ d} U {x ^ X \ X E IT \ {c}}. 



where 




Some Remarks on Networks of Parallel Language Processors 



89 



Obviously, 

for m > 0. 

Moreover, if the instance / has no solution of the Post Correspondence Problem, 
then 

f3,Ai{m) = fs^rim) for m > 0. 

However, if I has a minimal solution iii 2 . . . v (with respect to the length of the 
index sequence), then 



h,Ai{m) < f3,r{m) for m > r. 

Hence fs^Ai = fs,r and f A i = fr hold if and only if / has no solution. 

The components NLP_0L systems constructed in the proof of Theorem 3 are 
nondeterministic. The following statement is the deterministic version (which we 
present without proof, a proof analogous to the above one is given in [3]). 

[ Theorem 4.] Let x E {{reg)^ (rc)^ (for)}. Then it is undecidable whether or 
not the growth functions or the growth functions at some node coincide for two 
given (x)NPL_PD0L systems with the same number of components, 

A DOT system G = P^w) generates a unique sequence 

W = Wq Wi W2 VJm . . . 

of words. The growth function ^ N ^ N of G is defined by ^g(^) = ^(^m)- 
g is called a DOT growth function if there is a DOT system such that g — go ^ 
We mention that the problem whether or not the growth functions of two 
NPL_DF0L systems coincide is decidable if we consider growth function on the 
basis of multisets. This follows from the facts that such growth functions are 
DOT growth functions (see [2], Theorem 5.1) and that the growth equivalence of 
DOT systems is decidable (see [6], Theorem 3.3). 

We now present a relation between DOT growth functions and growth func- 
tions of NLP_F0L systems (with one component). 

[ Theorem 5.] For any DOL growth function g^ there is an (NLP_)F0L system 
F such that 



/r(0) — /i,r(0) — 5'(0)^ 

fr{2i - 1) = /i,r(2^f - 1) = g{i) for i > 1. 

[ Prop/.] Let G = (F, P, w) be a DOL system with 

V = {ai,a2, . . .,an}, 

Pi = Oi ^ XiiXi2 . . . Xir^ for 1 < ^ < n, 

^ = yiV 2 • • • ?/m with yj G D for 1 < j < m, 
9{G) = 9- 
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First let us assume that yi ^ yj for 1 < i < j < n. We construct the (NLP_)F0L 
system (with one component) (F', h\ Q^a) with 

V = V\J {Oij \ 1 < i <n,l < j < n}, 

n ri 

P — [J [J ^ j) y i^iy j) ^ {^iy j)} y 

i=l j = l 

P = {?/l,?/2, • • • ,?/m} 

and arbitrary filters. Since we have only one component, any communication 
step does not change the language, i.e. L2z-i,i = ^ 22,1 for i > 1. It is easy to 
prove by induction that any word z G ^ 22 - 1,1 the form 

= H^i-lyji-l){bi-2yji-2)y • • • , (^0, jo) 

where, for 1 < k < bk is the jk-i letter of the right hand side of the rule in 
P with the left hand side Therefore z G T2z-i,i starts with a letter bi of 

V and the remaining i — 1 letters store the "derivation” of bi in the DOT system 
G. In the sequel let b'- = {bi-iji-i){bi- 2 yji- 2 )y • • • , (^o, jo). 

If the DOT system G generates the word uiU 2 . . in i steps, then 

L2i-l,l = {uiu'l,U2,u2 ■ ■ 

Moreover, all words of L 2 _i_yi are pairwise different. Hence 

/r(2i - 1) = |-^ 2 i-i,i| = t{i) = l{uiU 2 ■ ■ = g{i). 

By slight modifications (introduce primed versions of letters) we can prove the 
statement for the case that some letters occur a number of times in the axiom 
of the DOT system. 
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[ Abstract.] In this paper we present a new computational 
model based on DNA molecules and genetic operations. This 
model incorporates the theoretic simulation of the main genetic 
algorithms operations like: selecting individuals from the popu- 
lation to create a new generation, crossing selected individuals, 
mutating crossed individuals, evaluating fitness of generated in- 
dividuals, and introducing individuals in the population. This 
is a first step that will permit the resolution of larger instances 
of search problems far beyond the scope of exact and exponen- 
tially sized DNA algorithms like the proposed by Adleman [1] 
and Lipton [2]. 



1 Introduction 

In a short period of time great advances have been reached in DNA based com- 
putations. With a large number of DNA strands and with some biological oper- 
ations it is possible to obtain an universal model able to resolve any given decid- 
able problem. Adleman [1] began this field describing an abstract model that he 
applied to the resolution of a A/"P-complete problem, the Directed Hamiltonian 
Path. Then, Lipton [2] showed how to resolve more general problems through 
finding a solution to the SAT problem. 

In this paper, we show the possibility of solving optimization problems with- 
out generating or exploring the complete search space. In the second section of 
the paper the steps of the genetic algorithms as well as their main operations are 
explained. In the third, we describe the principles of molecular computation and 
the main molecular operations. In the fourth section, we detail the simulation of 
genetic algorithms employing DNA molecules. 



2 Genetic algorithms 

Genetic algorithms are adaptative searching techniques inspired on the Dar- 
win’s evolution theory. According to Darwin’s theory, individuals who are better 
adapted to the environment, survive and transfer some kind of information to 
their descendants through their genetic code. Genetic algorithms follow the same 
principle: potential solutions for a problem are coded by generating an initial 
random population that will develops thanks to recombination and mutation 
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operations. The following generations are formed by evolution so that in time 
the population comes to consist of better individuals (solutions). In this way, 
the best solutions to the problem survive and transfer their internal information 
to their descendants. 

The structure of a basic genetic algorithm include the following steps: (1) 
Generate an initial random population evaluating the fitness for each individual, 
(2) Extract individuals, (3) Cross and mutate extracted individuals, (4) Evaluate 
and introduce the new created individuals, looping steps 2 to 4 until population 
has converged. 

3 Molecular Computation 

Molecular computation, or DNA-computation encodes solutions of complex prob- 
lems in DNA strands and applies conventional techniques from molecular biology 
to obtain the right solution filtering out the wrong candidates. 

Biological computation presents a set of advantages and disadvantages with 
regard to conventional computation. Some advantages of molecular computation 
are massive parallelism (10^*^ molecules of DNA in a tube), high storage density 
and low energy consumption. 

Among disadvantages that can be found in molecular computers are the 
limited number of biological operations available as well as the complexity and 
experimental errors of these biological operations. 

3.1 Main Molecular Operations 

We present the main biological operations that we employ to manipulate DNA 
strands. 

— Strands separation according to their length using gel electrophoresis [8] . 

— Strands separation according to a determinated subchain s using comple- 
mentary probes anchored to magnetic beads. 

— Denaturation of DNA strands. 

— Strands Separation with a certain symmetry forming palindromes: An initial 
tube T containing DNA chains is separated in two tubes and 'iT Ti 
will contains the chains that have a palindrome symmetry and the tube 
7 2 the rest of chains. A strand has palindrome symmetry when contains 
complementary substrands at both sides of a central point. If we denature 
the double strands, the following strands will remain having the following 
geometry due to the complementary nucleotides in the symmetry zone (figure 
1). To separate the chains forming palindromes from the ones that do not, 
we will use gel electrophoresis [8] . 

— Append a sequence of nucleotides in a free end of a strand [3]. 

— Site directed mutagenesis [4]. 

— Cut on strands: Restriction enzymes cut the DNA strands in a sequence spe- 
cific form. The cut produced by the restriction enzymes may leave sticky ends 
or blunt ends in the DNA fragments. DNA fragments with complementary 
sticky ends may reanneal forming a new double strand. 

— Chain duplication through Polymerase Chain Reaction (PCR): With this 
operations, two identical tubes and 7 2 are obtained from a first tube T. 
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Fig. 1. Palindrome loop. 



4 Genetic Algorithm Simulation with Biological 
Computation. 

The simulation of genetic algorithms with DNA molecules require a set of test 
tubes where each one has a well defined utility: (1) Population tubes (2) Mat- 
ing pool tubes and (3) Temporary tubes. Individuals from different tubes have 
different format. 



4.1 Individual Encoding 

The choice of a codification is a key point for the correct evolution of the popu- 
lation towards the final solution. Individuals need to be coded so that it can be 
made combinations, duplications, copies, quick fitness evaluation and selection 
of a specific individual inside the population or mating pool without the need of 
a complete sequencing. 

— The Tipton encoding [2] is used to obtain each individual coded by a sequence 
of ones and zeros, where the ENC[b^i) function returns the codification of 
6 (0 or 1) at the Ah bit place. The codification returned from the ENC 
function is unique for each value of b and i. /,From now on, ENC(b^i) will 
be represented by bi, 

— Between the DNA code belonging to bits places i and i + 1, a cutting or 
cleavage site for a restriction enzyme will be inserted. 

— To allow a quick fitness evaluation, a field with the adaptation grade of the 
individual is included in the DNA strand. The fitness may be coded so that 
its length is proportional to the value it represents. 

— A field indicating the numbering of that individual in the population will 
be included to locate it and it will be inserted at both sides of the DNA 
strand with the peculiarity that in one side (sequence N'p) it is symmetrically 
complementary with a palindrome structure (sequence Np) that allow a later 
separation operation. 

Also one more cleavage site REp for a restriction enzyme will be included to 
separate its numbering. 
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Fig. 2. Final individual encoding. 



Taking everything into account the final encoding of an individual has the 
following format (figure 2). 

It must be considered that the individuals of the population may go to the 
mating pool from where they will be selected for later genetic operations, so some 
identification is needed in the usual codification to select an specific individual 
inside the mating pool. That identificator will be the sequence This value 
will be introduced in the fitness value field so that mating pool individuals have 
the same format as population ones with Nm values instead of the fitness field. 

4.2 Generation of the initial population 

For a generic population with rn individuals, in which each individual represents 
his genotype with n bits, n — 1 steps are needed to generate all the initial pop- 
ulation. Also, in each step we would need to sequence 2m molecules. The steps 
of the initial population generation algorithm are the following. 

Synthesizing half of the population with a value 0 in the first bit place and 
with N having values for numbering half of the population. These sequences will 
have the following format: ATb-Oo-ATp-A^-PCR Primer, and the other half 
with a 1 in the first bit place with N having values for numbering the rest of the 
population. These sequences have the format: RTb-lo-RTp-A^-PCR Primer. 

It is created also the sequences REi-1i-REq and REi-Oi-REq. The pre- 
vious chains are put together in a temporal test tube. The restriction enzyme 
corresponding to the site REq is applied. The strands are cut and DNA lig- 
ase is applied to rejoin the fragments. Supposing that there are not mistakes 
in the operations of joining rn molecules, the following sequences are obtained: 
REi-hi-REo-ho-REp-N'^-VCR Primer with different values for b and A. 

Until arrive at the step n— 1, in a generic step i, the half of the individuals will 
be created with encoding format REi-Oi~REi-i and the other half with value 
1: REi-li~REi-i. They will be joined in a temporal test tube with strands 
obtained in step i — 1. The restriction enzyme corresponding to site REi-i and 
DNA ligase will be applied to obtain resultant strands for step i. Strands will 
have the following format: REi-bi-REi-i ~. . -REo-bo-REp-Np-PCR Primer. 

When i = n — 1, REf should be used instead of REi. 

Once we get the individuals with the last format it is necessary to evaluate 
the fitness function to append its value [3] and join the number Np. This number 
is encoded with a symmetric sequence of A^. We need log 2 rn steps of extraction 
and append to write the log 2 m bits of Ap. 
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4.3 Individual selection for mating pool 

Length separation is used to select individuals to fill the mating pool assuming 
that better adapted individuals are longer than worst adapted individuals. There 
are different methods for creating the mating pool. 



Method of the best ones The purpose is to take the best n individuals of 
the population. Length separation will be used so that the n best individuals are 
selected. For viewing individuals with its respective length we add a radioactive 
marker corresponding for each individual number, so that applying X-rays to 
the gel electropheresis all individuals sorted by fitness will be seen and the n 
longer will be taken. 



Roulette method This method will be carried out using a scale function. 
The referred function obtains the strands fitness contribution — its length — to 
global fitness. Starting from scale function we obtain the inverse function that 
will return the individual identification through the fitness degree aportation to 
global fitness in that particular population. Then, numbers will be generated 
randomly between one and hundred; and using the inverse function we get the 
individual identifications for these numbers. These individuals will be taken off 
the population separating the chains starting from the sequence corresponding 
to the individual number using magnetic beads. 



Crowding method This method consist on taking individuals randomly from 
the population; then, those individuals are introduced in a temporal test tube 
and the best of them will be taken to introduce it in the mating pool. The last 
step will be repeated until the mating pool is filled. 

For taking individuals randomly, a random number must be generated and 
obtain the individual from population as it has been described previously. For 
getting individuals randomly, instead of generate a random number, a little 
quantity of chains from the population tube could be taken and the best element 
is chosen separating it by length. 

Once individuals selected using one of the previous explained methods are 
introduced in the mating pool test tube, it is necessary to change chain format 
to introduce individual numbers for the mating pool. This mating pool number- 
ing will replace fitness field in the population format and it will be introduced 
applying site directed mutagenesis [4] over the mating pool tube individuals. 

4.4 Individual Crossing 

For crossing operation the mating pool individuals will be taken by pairs or the 
crowding method will be used for selecting an individual from the population who 
will be crossed with one from the mating pool. For getting a pair of individuals 
from the mating pool, we use strands separation according to the subchain of 
the two mating pool numbers selected to be crossed. In case that the crowding 
method is used, two individuals with different format will be crossed, due to 
the different chain format between population and mating pool individuals. For 
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solving this, it is enough to place in the individual of the population a mating 
pool individual number 0 never used by a mating pool individual. 

If a one point crossing will be made, a crossing point will be chosen ran- 
domly. It will be equivalent to choose a restriction enzyme randomly among 
REq . . .REn-i^ Then, apply the chosen restriction enzyme and the chains are 
cut in two parts. It will be possible that original individuals will be generated 
again; to avoid this, chains having palindrome structure will be separated from 
the rest of the resultant chains, so that chains that have palindrome structure 
are the chains for which the crossing has not been made and these chains will 
have to be removed. The several points crossing can be made in several one point 
crossing steps. 




Fig. 3. Individual crossing. 



4.5 Mutation 

Once the individuals have been crossed, mutation operation should be applied. 
This mutation can be achieved employing site directed mutagenesis [9]. Previ- 
ously to the substitution operation it is necessary to apply the probability of 
the mutation for each bit. To do so if the generic bit i is pretended to be mu- 
tated the following pieces will be generated in two different temporary tubes: 
REi — Oi — REi_i and REi — T — REi_i. One of the last short strands will be 
taken randomly, so that depending on the value of the bit i in the individual 
strand and on the value of the short strand chosen the mutation will be made 
or not. This mutation will be repeated for each bit. 
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4.6 Introducing new individuals in the population 

Recombined and mutated strands are evaluated and its fitness value are in- 
serted. Once obtained the final chains to be introduced in the new population, 
it is necessary to extract the individuals from the last generation. When those 
individuals have been selected, one by one are taken from the mating pool by 
their identification number, so that for each individual we have to make this op- 
eration: take the individual to be extracted and the individual from the mating 
pool, introduce them in separated temporary tubes and apply the restriction en- 
zyme corresponding to REp. Strands are separated by length getting the longer 
one from the tube containing the mating pool individual and the shorter one 
from the tube containing the individual to be extracted. They are introduced in 
a temporary tube and DNA ligase is applied (figure 4). 
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Fig. 4. Individual replacement. 



4.7 Evolutionsiry computation 

In this subsection, we propose an alternative evolutionary computation based on 
restrictions. We have an initial test tube with individuals and a list of restrictions 
that the individuals must verify to become a right solution. The fitness function 
of an individual is the grade of ‘accepted’ restrictions and we can evaluate these 
restrictions parallely for the whole population. We can crossover strands of in- 
dividuals of a tube that agree with one restriction with strands of individuals 
of another tube that verify two or three restrictions. Then, we are mixing the 
Adleman’s style of computation based on restrictions [1] with evolutionary search 
techniques. 
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5 Conclusions 

We have simulated with DNA molecules and recombinant DNA technology the 
main genetic algorithms operations. Until now the molecular computation has 
been used to solve WP-complete problems with exact and ‘brute force’ algo- 
rithms. It is necessary for DNA-computation to expand its algorithmic tech- 
niques to incorporate aproximative and probabilistics algorithms and heuristics 
so the resolution of large instances of A/"U-complete problems will be possible. 
The simulation of concepts of genetic evolution with DNA that we have pre- 
sented will help DNA-computation to resolve more complex problems because 
introduce genetic and evolutive search in the list of molecular computation tech- 
niques availables. 
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Abstract. We explore the relationship between Marcus contextual lan- 
guages and CP-languages. We prove that external and internal contex- 
tual languages without choice are 2CP- and 3CP-languages. We extend 
these results to contextual languages with choice, by appropriately defin- 
ing a concept of selective CP-language. 



1 Introduction 

The class of Marcus contextual languages, and especially their generative mech- 
anisms, the contextual grammars, are, since their introduction in 1969, both 
objects of intensive studies and helpful tools, in various areas of Linguistics, 
mathematical or not. The pioneering paper [3] introduces contextual grammars 
in their external variant, and [6] introduces the internal variant. The mathemat- 
ical study of contextual grammars developed quickly, and is now a very large 
field, despite this, still rich in open and challenging problems. We refer the reader 
to the two monographs [7], and especially [8] to convince himself. 

Cut-and-Paste languages were recently introduced [1], and are still trying 
to find their place. All the necessary definitions are given in this paper. Their 
"generative device” is Kleene’s fix-point theorem. Their name is a natural con- 
sequence of the way "monomials” and "polynomials" are defined: they "cut" a 
word into n pieces, then "glue" them with coefficients, and finally "paste" the 
pieces. Cut and paste operations are central in DNA computing, molecular com- 
puting, see for instance [9] for overwhelming proof. The kind of cut-and-paste 
we describe by means of the concept of nCP-language is probably the most rudi- 
mentary one. Still, there are some very pregnant analogies with the operations 
performed by contextual grammars, which motivate the beginning of a com- 
parative study, of which our present paper is nothing but a very small step. It 
can be also seen as an attempt to generate contextual languages via a fix-point 
mechanism, attempt which has been successfully made by others too, see [2]. 

We have considered here only the most simple types of contextual grammars, 
external and internal, first without choice, then with a choice function of the 
simplest type 0 : R* ^ As described in paragraphs 3 and 4, the passage 

from contextual grammars to nCP-polynomials with n = 2 and 3 presents no 
problems. The reverse passage, when one takes into account the most general 
form of an nCP-polynomial, does not "fall back precisely” on the same spot, 
as illustrated at the end of paragraph 3. Languages generated by contextual 
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grammars with shuffled contexts, a concept introduced in [5], could be considered 
as candidates for successfully completing this passage. 

We introduce in 5 the concept of a ^-selective nCP-language attached to a 
contextual grammar with selection function and afterwards the concept of 
selective nCP-language attached to an insertion grammar. We think that the 
mechanism described there is crucial for understanding the interplay between 
strings and contexts, interplay embedded in the choice operation, and not only. 
The typology generated by this interplay, as presented in [4], can be fully covered, 
and even enriched, by considering a more general concept of selective nCP- 
language. 

2 Definitions and notations 

Let V be an alphabet, V* the free monoid generated by R, and let 1 e V* denote 
the empty word. 

For a natural number n > 2, consider the product monoid 

(W)^ = y* X w X ... X w 

^ V 

n times 

with concatenation defined component wise and (1, 1, . . . , 1) the neutral element. 
Let Cn : ^ R* be the n- concatenation function^ defined by 

:= XiX2...Xn G R*, 

and consider its extension to the image function c^ : V{y). 

Let c~^ : V{V) V{{V)^) be the pre-image function, i.e. for re G R* 

G {VT I XiX2...Xn = w] , 

and for an L C R*, let c~^ (L) = Ui(;gl 

^((y*)n) ^ monoid with the usual product of sets. For fixed (ai, . . . , a^), 

(6i, . . . , bn) in identified with their corresponding singletons, we have the 

product 

(ai, 02, . . . , an)c~^{w){bi, 62, • • • , bn) = 

{{aiXibi,a2X2b2,...,anXnbn) I X 1 X 2 . . .Xn = w}. 



Definition 1. We call first degree n-cut-and-paste monomial (^n CP-monomial 
for short), an expression of the form 

M{X) = Cn{{ai,a2, . . . ,an)c~^{X){bi,b2, . . . ,bn)). 

We will call first degree nCP-polynomial an expression 

k 

p(x) = \Jmxx)\Jl, 

i=l 

where Mi {X), i = l,k, are first-degree nCP -monomials, and L C R* is finite 
and is called the free term of P. 
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The monomial function M : V (V^*) ^ V is defined by the composition 
of functions o (m, a2, . . . , ^ ^2, • • • and all these functions 

commute with arbitrary unions. It follows that any monomial function and also 
the polynomial functions commute with arbitrary unions. In particular, they will 
be cj-continuous functions from V {V) to itself. 

Applying Kleene’s fix-point theorem we obtain the following result: 

Theorem 1. Let P he a first-degree nCP -polynomial function, Then^ the fix- 
point equation P{X) = X has a least solution, L, obtained as the limit of the 
Kleene sequence: 

L= \J P”^(0). 

m>0 

L is also the smallest L' e V {V^) such that P{L') C L' . 



Definition 2. A language L C V* which is the solution of a fix-point equa- 
tion P{X) = X , with P a first-degree nCP -polynomial, will he called an nCP- 
language. 



3 External contextual languages without choice are 
2CP-languages 

A contextual grammar without choice is a triple G = (V,A,C), where F is a 
finite alphabet, A is a finite language over V (the set of axioms) and (7 is a finite 
subset of V X V (the contexts). 

With respect to such a grammar, the external derivation relation on is 
defined by: 

X y y = uxv, for a context {u, v) G C, 

ex 

and by we denote as usual the reflexive and transitive closure of 

ex ex 

The external contextual language generated by G is: 

Lex{G) ■.= {y&V*\x^y,x&A}. 

ex 

It can also be defined as the smallest language L C V* such that: 

1 . ACL; 

2. if X C L and {u, v) G G, then uxv G L. 



Theorem 2. L^^iG) is a first-degree 2CP-language. More precisely, L^^iG) 
is the least solution of the fix-point equation attached to the first-degree 2 CP- 
polynomial: 

Pg(X)= U C2{{u,l)cp{X){l,v))uA. 

(u,v)ec 
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The proof is straightforward, and is based on identifying each member of 
Pg8 Kleene sequence, Pq{^)^ with the set {y ^ V \ x x ^ A} of words 

ex 

obtained from axioms by applying at most k derivations (we start the Kleene 
sequence with X_i = 0, Xq = A, etc.). 

This first result shows that we can make a "canonical” passage from a contex- 
tual grammar G to a first-degree 2CP-polynomial, P^, and that the languages 
generated by the two distinct mechanisms coincide. 

The next result shows that we can also go ("canonically”) in the reverse 
direction, from 2CP-polynomials of a certain type, to contextual grammars. 

Theorem 3. Let 



n 

P{X) = \Jc2{{ai,l)c^\X)il,h))uA 

i=l 

be a 2 CP-polynomial, and let X he the least solution of the fix-point equation 
attached to it. 

With C := {{oi, bi) \ i = l,n}, Gp = (V, A, C) is a contextual grammar such 
that X = Lex{Gp). 

The proof is similar to that of the preceding theorem. 

Note that, in order to fall precisely on external contextual languages, we had 
to confine ourselves to 2CP-polynomials of a very particular type. 

We are going to consider now 2CP-polynomials of the most general form, and 
construct a more general type of contextual grammar. 

n 

Let P(X) = U C 2 ((a^, bi)cf^{X){a^-, b'^))UA be a 2CP-polynomial, in its most 
i=l ^ 

general form. Consider the finite subsets of V* x V* 

Gpref • ~ {(^25^2) I ^ ~ 

Csuf : = \ i = l,n} 

and call them, respectively, prefix- contexts and suffix- contexts. We associate to 
P the quadruple: 

Gp = (y. A, Gpref ,Gsuf)- 

For x,y ^V* define the (one-step) derivation in Gp as follows: 

X y iff V(xi,X 2 ) G y* X y* such that ^ 1 X 2 = x, 

Gp 

3{u,u') G Gpref ^ 3('c,'c') G Gguf such that y = uxiu' vx 2 v' . 



Define the language generated by Gp as 

L{Gp) := {y\x=^y,x £ A}. 

Gp 

It can alternatively be described as the smallest language P C y* such that: 
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1. ACL; 

2 . X e L with X = X\X2 ^ uxiu'vx2v' G L for any {u^u') G Cpref and 

) G C suf • 

Theorem 4. L{Gp) is precisely the least solution of the fix-point equation at- 
tached to P. 

As we can see, the grammar Gp = Cpref suf) attached to a 2CP- 

polynomial in its most general form, is not a contextual grammar without choice 
in the classical sense, but a much richer structure. The richness comes not only 
from the “multiplication” of the set of contexts, but essentially from the fact that 
the second component of a prefix-context, concatenated with the first component 
of a suffix-context, appears as an internal context in a derived word, thus blurring 
the distinction between external and internal languages generated by such a 
grammar. Only if Cpref C y* x {1} and Csuf ^ {1} x F*, then, by taking 
C = TVi{Cpref) X TV 2 {Csuf)i cau we fall back again on a contextual grammar 
without choice G = A, C). 

4 Internal contextual languages without choice are 
3CP-languages 

Let Gp = {V^A^C) be a contextual grammar without choice. Consider the 
internal derivation relation defined by: 

X ^ iff X = X1X2X3, y = X1UX2VX2, for any xi, ^2, ^3 G F*, (it, u) G C 

in 

The internal contextual language generated by G is: 

Un{G) :={yGV* \ x^y,xeA}. 

in 

It can be written as the union Lin{G) = |J L^^{G) where L^^{G) is the set of 

k>0 

words obtained applying at most k derivations to the axioms in A. 

Consider now the 3CP-polynomial: 

Pg{X)= y C3((l,a,6)c3^(X))uA 

(a,b)ec 

Theorem 5. Lin{G) is a CP -language. More precisely, Lin{G) is the least 
solution of the fix-point equation attached to the above polynomial Pq- 

The proof is again straightforward, and is based on identifying each L^^{G) 
with the k term of the Kleene sequence, Pq{ 9 )^ starting the sequence with 
Py(0) = 0, PO(0) = X etc. 

As in the previous paragraph, we will now go from 3 CP-poly nomials towards 
internal contextual languages, via grammars. 

For 3CP-polynomials of a particular form, the passage is easy: 
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n 

Theorem 6. Let P{X) = |J cs{{l^ai^bi)c^^{X))uA be a 3 CP-polynomial, and 

i=l 

denote by X its least fix-point solution. With C := | i = l,n}, Gp = 

(y, A, C) is a eontextual grammar sueh that L^^(Gp) = X. 

5 Contextual languages with choice and selective 
nCP-Languages 

A contextual grammar with choice is a quadruple G = (R, A,C, 0), where 
(y, A, G) is a contextual grammar without choice as before, to which a choice 
(or selection) function <j> \ V ^ ^(C) was added. With respect to such a 
grammar the external and internal derivation relations on y* are defined 
by: 

X y y = uxv, for a context iu, v) G fiix), 

ex 



X ^ iff X = X1X2X2,, y = X1UX2VX2, for any x\,X 2 ^x^ G y* and {u, v) G <j>{x2) 

in 

Their respective reflexive and transitive closures define the external and the 
internal contextual language with choice generated by G, by: 

L(x{G) := {y e V* \ X y,x e A} , with a G {ex, in}. 

a 



We intend to prove that, as in the case without choice, contextual languages 
with choice can be obtained as nCP-languages, with n = 2 for external ones and 
n = 3 for internal ones. In order to achieve this goal, we will have to generalize the 
concept of nCP-polynomial, to that of selective nCP-polynomial function. 

For a fixed {u,v) G G, consider the partial function M(^u,v) : y* ^ y* 
with domain Dom{M(^u,v)) = G y* | {u,v) G fi{x)}, and defined for x G 
Dom{M(^u,v)) by = uxv = C 2 {{u, l)c^^(x)(l, 'c)). We recognize in the 

last term of these equalities a 2CP-monomial function acting on a word x G y*. 
We can extend M(^u,v) fo a total function, M(^u,v) • PiV^) P(y*), by defining 
it on singletons as 



M(u,v){{x}) 



{uxv} , if X G Dom{M(^u,v)) iff ('^ 7 '^) G fi{x) 
0 , otherwise. 



and extending it naturally to a monotonous function by making it commute with 
arbitrary unions. 

Definition 3. A funetion : P(V*) P{V*), defined as above will he 

ealled a ^-selective 2CP-mononomial function. A funetion P : V{y) V{y) 
defined as 

P(X):= IJ 

(u,v)eC 

will be ealled a ^-selective 2 CP-polynomial function. A ^-selective 2 CP-language 
will be a language L C y* which is the least solution of the fix-point equation 
P{X) = X attaehed to a fi-seleetive 2 CP-polynomial funetion P. 
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We now have the following result. 

Theorem 7. The external eontextual language with ehoice Lex{G) is a (p-seleetive 
2CP-language. More preeisely, given the external eontextual grammar G, we ean 
eanonieally assoeiate to it a (p-seleetive 2 CP -polynomial funetion Pq as above, 
sueh that L^xiG) is the least solution of its attached fix-point equation. 

We will now consider the case n = 3 in order to obtain a similar result for 
internal contextual languages with choice. 

Consider the partial function N(^u^y^ : V* ^ V* obtained by the following 
recipe: restrict the codomain of to x Dom{M(^u^y^) x and it will 
become a partial function, then extend it monotonously on V{y), and make 
it total by letting singletons {x} go to 0 for x’s not in its domain. We obtain 
N [u,v) • P(V*) V{V*), which is, with the considerations just made, precisely 

C3((1,w,?;)c3^(-)). 

Definition 4. A funetion N : V(V*) P{V*), defined as above will be 

ealled a ^-selective 3CP-mononomial function. A funetion P : V{V*) V{V*) 

defined as 

p(x):= y ]V(„,,)(x)y^ 

(u,v)ec 

will be ealled a ^-selective 3CP-polynomial function. A ^-selective 3CP-language 
will be a language L C V* which is the least solution of the fix-point equation 
P{X) = X attaehed to a f-seleetive ^ CP -polynomial funetion P. 

Theorem 8. The internal eontextual language with ehoiee Lin{G) is a f-seleetive 
CP -language. More precisely, given the external eontextual grammar G, we ean 
eanonieally associate to it a f-seleetive 3 CP -polynomial funetion Pq as in def- 
inition 4 above, sueh that Li^iG) is the least solution of its attached fix-point 
equation. 

After having exemplified the construction of ^-selective nCP-polynomials for 
contextual languages with the simplest type of selection function <j> ^ "P{G), 

there is no problem to generalize it for <j> : (F*)^ ^ 'T^{G), thus obtaining the 

total contextual languages with choice as ^-selective 3CP-languages, 
and, for <j> : {y)'^ "P{G), the n-contextual languages as ^-selective nCP- 

languages. 

Consider now an insertion grammar G = (V,A,P), where V, A are as 
above, and the finite set P C (y*)^ is called the set of productions of G. The 
corresponding language L(G) is defined as usual, using as one-step derivation 
the following relation: 

X y iS X = x\uvx 2 ^y = xiuxvx 2 for any x\,X 2 C F* and {u, x,v) e P 

For fixed {u,x,v) G P consider the pre-image of two-concatenation with re- 
stricted codomain x vM^, and apply the recipe: make it total 

by letting singletons {x} go to 0 for x not in its domain, and then extend it 
c^;-continuously to get a monomial function Q(u,x,v) • P{V*) V{V*), defined 

by Q{u,x,v)(X) = C2{{1 ,x)C2^{X)). 
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Theorem 9. The language generated by an insertion grammar G is a seleetive 
2CP-language. More precisely, it is the least solution of the fix-point equation 
attached to the seleetive 2 CP-polynomial 

P(X)= U Q(u,x,v)i^)[jA. 

(u,x,v)eP 



6 Some concluding remarks 

The construction of different types of selective nCP-polynomials done in the 
last section illustrates the whole delicate interplay between strings and contexts. 
In particular, the selective 2CP-polynomials attached to an insertion grammar 
open the way towards an alternative formalism for the splicing operation, based 
on CP-polynomials. 

We hope that, completing the research started here with the reverse passage, 
from selective nCP-languages to contextual ones, will contribute to one of the 
goals proposed by S. Marcus in [4], namely ”to enrich the combinatorics of dis- 
tinctions left-right, internal-external ...” providing at the same time an extremely 
powerful tool for their unified study. 
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Abstract. In this paper, we deal with a class of contextual languages 
defined in the frame of order-sorted algebra. We define a new way to 
describe contextual (multi) languages, using constraints. For this class 
of constraint contextual languages a learning method with motivations 
from natural language processing is provided. 



1 Introduction 

It is a reality that precise grammars for natural language processing are very 
hard to describe. At the same time, learning such grammars is usually a slow 
process and the result is difficult to use in practice. A comfortable compromise 
between these directions might be to write some general rules and then use a 
learning method to achieve other specific informations. 

Contextual grammars ([6], [5]) are a very interesting model for describing non- 
context-free constructions in natural languages. Order-sorted algebra ([!]) is a 
powerful algebraic approach to computational semantics. At the intersection of 
these areas, we have developed a generative formalism called order sorted multi- 
languages ([3]), where syntactical categories are described as sorts and derivation 
rules as operations. We have also developed an extended stack automaton called 
operatorial automaton to deal with order- sorted multilanguages. 

The link between order- sorted multilanguages and operatorial automata is 
established by a set of operatorial relations which are a generalization of operato- 
rial precedence relations. A very interesting way to define constraint order- sorted 
multilanguages can be described using operatorial relations: define a “maximal” 
multilanguage and then constrain it by selecting a subset of operatorial relations. 
Only those words which are consistent with the remaining set of operatorial re- 
lations will be kept. 

The learning algorithm developed in this paper for constraint contextual 
multilanguages splits the paradigm writing/ learning in the following way: the 
“maximal” contextual multilanguage must be written and the actual subset of 
operatorial relations constraining the language may be learned from examples. 

We believe that derivation rules are not as hard to write as it is to describe 
the precise application of these rules. Our learning algorithm covers two aspects: 
how the derivation rules apply and to which arguments may they be applied. 
Considering the case of natural languages, the first aspect concerns the order of 
words in a sentence, while the second aspect includes a refinement of sorts and 
operations. 
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2 Preliminaries 

We denote by [n] the set of the first n natural numbers, not equal to 0. A partially 
ordered set or poset is a set S together with a binary relation < on 5 that is 
reflexive, transitive and antisymmetric. 

Let R be a set. We denote by 2^ the set of all finite subsets of V. We denote 
by R+ the set of strings vi . . .Vn, with Vi G V, for any i G [n] and we denote 
= R+ U {A}, where A is the empty string. Let w G R*. We denote by \w\ the 
length of w and by \w\a the number of occurrences of the character a in w. We 
say that u = ui . . . is a prefix of w = w\ . . . Wm iff n < m and Vi = Wi^ for any 
i G [n]. Also, A is a prefix of any string of characters. 

The shuffle operation between words, denoted LLi , is defined recursively by 
av ill bw = a{v ill hw) U h{av hi w) and willX = Xillw = w^ where v^w G R*, 
a,beV. 

Definition 1. A (simple) order-sorted signature is a triple (5, <,A) sueh that 
(5, <) is a non-empty poset, ealled the sort set and U is an 5* x S -sorted family 
{Sw,s/'^ ^ 5*,<s G S} sueh that Swi,si H ^ S*, 

Si,S 2 G S, with (lCi,Si) ^ (lC 2 ,S 2 ). 



Definition 2. Let (5, <,A') an order-sorted signature. An order-sorted multi- 
algebra over U is a pair A = where Ag is a S -sorted family 

of sets such that s < s' implies Ag C Ag' , and the operations are defined as: 

i) if (J ^ ^\,s, then a a G 2^«; 

a) if (J e >ds^...Sr^,s, then x . . . x Ag^ 2 ^« . 

We denote \ A |= 

Definition 3. Let A be an order-sorted multialgebra over U and s e S be a 
sort. An element a e Ag is ealled reachable iff: 

— there is a G Aa,s sueh that a ^ a a, or 

— there are a G and Oi G Ag., i G [n], reachable elements, such that 

a G cr^(ai, . . . ,an). 

A is called reachable iff it has only reachable elements. 



3 Contextual multilanguages 

Let A be an order-sorted signature and V a finite, non-empty set, called alphabet. 



Definition 4. A reachable order-sorted multialgebra A is an order-sorted mul- 
tilanguage over A and V iff \ A \C R+ . 

Definition 5. An operation a a : Ag^ 2"^®2 is a contextual rule (respectively 
a shuffle rule^ iff there exists an word maskAff) G V~^ such that (7 a{ol) C 
maskAff) hi q; (respectively (JA{of) = maskAff) hi 
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Definition 6. An order-sorted multilanguage is a contextual multilanguage (re- 
spectively a contextual language with shuffle ) iff it contains only constants and 
contextual (respectively shuffle) rules. 

We call a derivation of sort S 2 of a in A a, construction o; iff a G 
and q; G (Ja, or a construction ft o; iff a G P C As^^ a = bi and 

a G 

Example 1. We consider {S,E) an order-sorted signature, with S = {s}, E = 
{(j^ s, : s ^ s} and V = {a, 6, c} an alphabet. Then, we consider a 
contextual language with shuffle A = (A^, a^), where Ag is the set of all 
words re G F*, such that \w\a = \w\b = \w\c and for any prefix x of re, \x\a > 
\x\b > \x\c and a\ = {abc}^ ~ ^ ^ 



4 Constraint contextual multilanguages 

We will examine in the sequel a way to generate multilanguages using contextual 
operatorial relations which are formally defined in [2] . 

Definition 7. Let A be a contextual multilanguage and Ra the set of operatorial 
relations generated by A. Let R be a subset of Ra- The constraint contextual 
multilanguage A/ r is a contextual multilanguage defined hy^ : 

T ^ Aj R — \^A cx^R^^‘^{a) C R}, for any a G g/ 

(/?)={ : P a,REi^) ^ ^}> P ^ A/ r)si- 



Example 2. We consider the contextual language with shuffle A defined in Exam- 
ple 1. We have = {(a, a, 2), (a, a, 3), (a, 6, 1), (a, 6, 3), (a, c, 3), (6, a, 3), (6, 6, 3), 
{b, c, 1), {b, c, 3), (c, a, 3), (c, b, 3), (c, c, 3)}.If i?={(a, a, 3), (a, b, 1), {b, b, 3), {b, c, 1), 
(c,c,3)} C Ra, then A/ r = {{A/R)s,a\^^,a\^J, with {A/r)s = {a'^b'^c^\n > 
1} and ci\i R ~ {abc}, (ry^(a"'6"'c"') = {a"'a6"'6c"'c}, for any n > 1. 



5 A lecirning method 

A problem when writing down a contextual multilanguage is to establish the 
precise definition of derivation rules. Even with constraint contextual multilan- 
guages we still have a problem: how many operatorial relations must be kept in 
order to obtain the desired constraint multilanguage. 

The learning method we introduce here uses an operatorial automaton which 
is formally described in [3]. The learning algorithm has the following steps: 

L Define a “maximal” contextual multilanguage A. 

2. Generate the corresponding set of operatorial relations Ra- 

^ R^^'^ (a) is the set of operatorial relations corresponding to a with respect to the 
derivation d of sort S 2 in A. 




no 



R. Gramatovici 



3. If M = (y, 5, Ra, is an operatorial automaton which computes the (least) 
sort of any word in A , define a learning operatorial automaton M' = 
(y, 5, R, Ra, and set = 0. 

4. Run the automaton M' on a set of examples EC \A\ in the following way: 

— if an operatorial relation between two characters a and b is required, take 
this relation from i?; 

— if R doesn’t contain relations between a and b or all the relations that R 
contains between a and b fail, search a relation between a and b in Ra; 

— if such a relation exists add this relation to R and continue the execution. 

Example 3. With the above notations, let us consider A the contextual language 
with shuffle from Example 1, with the set of operatorial relations Ra^ defined 
in Example 2. We take the set of positive examples E = {a^b^c^}. The learning 
method has a degree of randomness, depending on the choices we make, when 
selecting an operatorial relation from Ra- The learning operatorial automaton 
may run in the following way: 

(a,a,3)Gi? 

{aabbcc$, {$)., -) h (a66cc$, ($)(a)., _) h (a66cc$, ($).(a), _) 

(a,b,l)ER (b,b,3)ER 

h (66cc$, ($)(a)(a)., _) h (6cc$, ($)(a)(a6)., _) h (6cc$, ($)(a).(a6), _) 

{b,c,l)eR {c,c,3)eR 

h (cc$, ($)(a6)(a6)., _) h (c$, ($)(a6)(a6c)., _) h (c$, ($)(a6).(a6c), _) 

h ($, {$){abc){abc)., _) h ($, ($)(a6c)., s) h ($, ($)., s) 

and we obtain a set of operatorial relations i? = {(a, a, 3), (a, 6, 1), (6, 6, 3), (6, c, 1), 
(c, c, 3)}, hence the procedure finds exactly the last constraint contextual multi- 
language from Example 2. 

Theorem 1. The learning method deseribed above is an algorithm. R learns a 
eonstraint eontextual multilanguage, ineluded in the initial “maximaE eontextual 
multilanguage and consistent with the given set of examples. 

Proof. The learning procedure ends every time because the set of operatorial 
relations Ra (hence also R) is finite. □ 
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1 Introduction 

In this paper we consider the classification of languages generated by linear 
grammars with one nonterminal symbol depending on the minimal depth of 
decision trees for language word recognition. 

Assume that L is a language in a finite alphabet, n is a natural number and 
L(n) is the set of all words from L for which the length is equal to n. We denote 
by the minimal depth of a decision tree which recognizes words from 

L(n) and uses only such checks each of which determines Ath letter of a word, 
i C . .,n}. If L[n) = 0, then = 0- (Note: the belonging recognition 

problem a word to L[n) is not explored here, but any word must be determined 
in assumption that one belongs to A(n)). 

Instead of the function we will consider the following function: 

^l(^) = max{/iL(R^) : m < n}. 

In [2, 3, 4] it was shown that for an arbitrary regular language L either 
= ^( 1)7 or = 6>(logn), or = <9(n). This paper deals with 

the investigation of the function H ^ behavior for an arbitrary language L gen- 
erated by a linear grammar with one nonterminal symbol. We show that either 
^l(^) = ^( 1)7 or = 6>(logn), or = 6>(n). Further this results will 

be used for investigation of linear grammars with many nonterminal symbols 
and context-free grammars with one nonterminal symbol. 

In proofs methods of test theory [1, 2] and rough set theory [5, 6] are used. 
For a set of words L{n) we construst corresponding decision table, and use lower 
and upper bounds of minimal depth of decision tree for this one. 

One can interpret a word from L{n) as a description of an image on the 
screen with n cells: the Ath letter of the word defines the color of the Ath cell 
of the screen. In this case a decision tree which recognizes words from L[n) may 
be interpreted as an algorithm for the recognition of images which are defined 
by words from L(n). 

This work was partially supported by Russian Federal Program ” Integration” 
(project 473 ” Educational- Research Center ” Methods of Discrete Mathematics 
for New Information Technologies”). 
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2 Definitions 

Let A' be a finite nonempty set (alphabet) and E* be a set of all finite words in 
the alphabet including the empty word A. 

The root of a word a E E* is the word f3 E E* of minimal length such that 
for some natural number t the equality a = (E holds. The root of the word a 
will be denoted by rt[a). 

Denote by 7Vk{oi) the cyclic permutation of k letters from the begining to 
the end of a. Words o; and {3 will be called similar if there exists k such that 
rt[f3) = 7Vk{rt[a)). The minimal k such that rt[f3) = 7Vk{rt[a)) will be denoted 
by RT[a^f3). If the words a and (3 are not similar then RT[a^f3) = cx>. 

Denote by I [a) the length of the word a. 

Let A be a symbol which does not belong to E. We will consider arbitrary 
grammar E of the following form: 

A ^ aiSf3i , . . . , A ^ apSf3p, S ^ si, . . . , S ^ Sq 

where p > ly q > 1 and o;^, £j are words from E"" ^ i = 1 , . . . , p, j = 1 , . . . , g. 
Denote by L[E) the language in the alphabet E generated by the grammar 

r. 

With the grammar E we will associate the following two sets R\{E) = 
{rt{ai) : i = I,. . .pp.ai ^ X} and R 2 {r) = : i = 1, . . . ,p, f3i ^ X}, 

and also the matrix 3\[E) with p rows and 2 columns. Elements of this matrix 
Ail Ai 2 y i = 1, . . . ,p, are determined as follows: 

a) Ail = if Oii 7^ A, and An = 0 otherwise; 

b) Ai 2 = lEE) ^ ^ otherwise. 

Denote by rankA[E) the rank of the matrix A{E). 

For j = 1 , . . . , g denote by Ej the following grammar: 

S aiSf3i , . . . , A ^ apSf3p^ S ^ £j. 

It is clear that 

L{r)c\jL{r£. 

J=1 

One can show that 

q 

ma^{HL(rj){n) : j = 1, . . .,g} < HL(r){n) < («-) + 1) - 1- 

J=1 

Therefore = G{m8ix{Hp(^pp{n) : j = 1, . . . Consequently we can 

consider only the case when q = 1. 
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3 Main Result 

[ Theorem l.]Let F he linear grammar with one nonterminal symbol 

S aiS[3i ^ apS[3p, S ^ s. 

Then the following statements hold: 

F If \Ri{F)\ >2 or \R 2 {F)\ > 2 then HL(r){n) = 0{n); 

2. If |R!i(T)| < 1^ \R 2 {F)\ < 1 and rankA[F) < 1 then = 0 for 

any n, 

3. Let Ri{F) = {q;}^ ^ 2 {T) = {P} rankA[F) = 2. Then 
a) if as ^ sp then HL(^p^{n) = 0 for any n; 

h) if as ^ sp andRIfa^p) = oo or Rlfa^ p) ^ l{s)mod{l{a)) then 
= 6 >(log n), 

c) if as 7 ^ sp and Rlfa^p) = l{s)mod[l{a)) then Hp(^p^[n) = G[n); 

4 Examples 

[ Example i,] Consider the following grammar Fi: 

S ^ ISO, S ^ OSl, S ^ 0110, 

One can show that Ri{Fi) = R^ 2 (A) = {0^1} £^nd |R^i(Ti)| = 2. Therefore 

HL{r,)S) = <9(n)- 

[ Example 2 ,] Consider the following grammar /h: 

S' ^ lOlOSOlOlOlOl, S ^ lOlOlOSOlOlOlOlOlOl, S ^ A. 

One can show that i^i(r 2 ) = {10}, ^^ 2 (^ 2 ) = {01}, |S^i(r 2 )| = \R 2 {F 2 )\ = 1, 
A{F 2 ) = 2 5 ' Therefore Hp^p,^^{n) = 0 for any n. 

[ Example -S',] Consider the following grammar F^: 

S^10 S,S^S01,S^1 . 

One can show that RpFs) = {10}, ^^ 2 (^ 3 ) = {01}, |S^i(r 3 )| = \R 2 {F 3 )\ = 1, 
A{Fs) = 0^7 1 = 1 Therefore Hp(^p^^{n) = 0 for any n, 

[ Example Consider the following grammar Fp 

S^1S,S^S0,S^ A. 

One can show that ^^ 1 (^ 4 ) = { 1 }, R^ 2 (A) = {0}, \Ri{F 4 )\ = |S^ 2 (A)| = T 
A{F 4 ) = , IAt^ AO, RT{1,0) = 00 , Therefore Hp(^p^^{n) = Gfiogn), 
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Example 5,] Consider the following grammar E^: 

S ^10S,S ^ S01,S ^ X, 

One can show that A^i(r 5 ) = {10}, ^^ 2 (^ 5 ) = {01}, |A^i(r 5 )| = \R 2 {r^)\ = 1, 
^ lOA 7 ^ AOl, A!'i{10, 01) = 1 ^ /(A)mod(2). Hence = 



^(^5) = 
6 >(logn). 



0 1 



[ Example 6 ^,] Consider the following grammar Eq: 

5^ lOA, 5^ AOl, A^O. 



One can show that Ri{Eq) 

~1 0 ~ 

0 1 



A{Ee) = 

^L(re){^) = 6>(n). 



10 0 



= { 10 }, R2{Ee) = { 01 }, \Ri{Ee)\ = |i? 2 (r 6 )| = 1, 
0 01 , A!'i'(10,01) = 1 = /(0)mod(2). Therefore 
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[ Abstract.] Incomplete information systems are approached 
here by general methods of rough set theory (see Pawlak [6,7]). 
We define approximation spaces of incomplete information sys- 
tems and study definability and strong definability of sets of 
objects. 



1 Basic notions 

An information system is a quadruple I = V,p) such that Ay V are 

nonempty finite sets, and p : O x A V. Elements of Oy A and V are called 
objects y attributes and values y respectively, and p[Xy A) is the value of attribute 
A for object x. For instance, let O consist of some persons and A contain three 
attributes A, 5,C, interpreted as Sex, Age and Nationality, respectively. Then, 
p[XyA) is F or M, p{XyB) is a nonnegative integer, and p[XyC) is English or 
German or Polish etc. The set V consists of all values of attributes appearing in 
the system. 

The above notion of an information system can be used to express a complete 
knowledge about some piece of reality. In this paper, we are concerned with in- 
complete information systems, corresponding to an incomplete knowledge. Then, 
we admit p be a partial mapping from O x A to V; p{XyA) is undefined, if one 
does not know the value of attribute A for object x. For different models of 
incomplete information systems see Orlowska and Pawlak [5]. Hereafter, by an 
information system we mean an incomplete information system. We always as- 
sume the nonempty knowledge condition: 

(NKC) {\/x e 0){3A e A) p{xy A) is defined. 

An information system I = [OyAyVyp) is said to be completCy if ^ is a total 
mapping. 

To simplify the framework we often consider (incomplete) two- valued infor- 
mation systems in which V consists of truth values 1 and 0 (truth and falsehood). 
Attributes are partial unary predicates on the set of objects: attribute A holds for 
object Xy if p{Ay x) = 1, attribute A does not hold for object x, if p(A, x) = 0, and 
A is undetermined on x, otherwise. Every information system 1 = {OyAyVyp) 
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can be represented by the two- valued information system 1' = {O' ^ fJ) 
such that O' = O^A' = AxVjV' = {0, 1}, and p' is defined as follows: 

p'{x^ (^7^)) = 1 if p{x^A) = 

p'{x^ (^7^)) = 0 if p{x^A) A ^ £^nd p{x^A) is defined, 
p'{x^ (^7^)) is undefined if p{x^A) is undefined. 

The system I' will be called the two-valued representation of I and denoted 
by T{I). 

With any two-valued information system I = (O, {0, 1}, p) we associate 

a propositional language Lj whose atomic formulas are symbols A, for all A G 
A^ and complex formulas are built by means of logical connectives: -i and A. 
Disjunction V, implication ^ and equivalence ^ are defined as in classical logic. 

For any formula cp, we define two sets F{p)^ N {p) C O^ called the positive 
extension and the negative extension^ respectively, of formula p in system I: 

(PNO) P{A) = {xeO : p{x, A) = 1}, N{A) = {x e O : p{x, A) = 0}, 

(PNl) F{^^) = N{V). N{^v) = F{^), 

(PN2) F{<fAtl^) = Fp n F{i^), N{<f Alp) = N{<f)u 

Observe that P{p) H N {p) = 0, for every formula cp, but P{p) U N {p) = O 
need not hold. 

We briefly recall some notions of partial logic (or: Kleene 3- valued logic with 
strong connectives [1,4,2]) which is a standard logical formalism for describing 
incomplete information. The truth values are 0 (falsehood), u (truth value gap) 
and 1 (truth), and one stipulates the ordering 0 < n < 1. An assignment is a 
mapping a from the set of atomic formulas to {0, n, 1}, and it is defined for all 
formulas, by setting: 

(al) = 1 - a{(p), 

(a2) a{p AA) = min(o;((p), q;(^)). 

Let T be a set of formulas, and let cp be a formula. Then, T I-3 cp holds, if, 
for every assignment o;, if a{A) = 1, for all ^ G T, then a{p) = 1. I-3 is the 
consequence relation of partial logic. By I-2 we denote the consequence relation 
of classical logic (one considers classical assignments only, i.e. mappings from 
the set of atomic formulas to {0, 1}). 

Let / = (O, A^ {0, 1}, p) be a two-valued information system. For any x E O^ 
we define the set: 

Di{x) = {A : p(x, A) = 1} U {-lA : p(x. A) = 0}, 

called the description ofx ini. By 6j{x) we denote the conjunction of all formulas 
from Dj{x). Notice that Di{x) 7^ 0, by (NKC). 

[ Proposition l]For all objects x E O and formulas p of Lj^ there hold the 
following equivalences: 

(Dl) X G F{p) ify (ind only if^ Idi{x) I-3 p^ 

(D2) X G N{p) ifj and only if^ Dj{x) I-3 -ip. 

(Dl) and (D2) can be proved by induction on p, using (PN0)-(PN2) and 
obvious properties of I-3 . For complete information systems I , I-3 can be replaced 
by h2- 
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2 Approximation spaces 

Given a complete information system I = (O, A, V, p), one defines the indiscerni- 
hility relation on the set O in the following way: x y if, for all A E Ay 
p{XyA) = p[yyA). For X C Oy ClX is the join of all equivalence classes of 
which are totally contained in Xy and CjjX is the join of all equivalence classes 
of which are not disjoint with X, ClX and CjjX are called the lower ap- 
proximation and the upper approxirnationy respectively, of the set X. A set X is 
said to be definable in /, if A = ClX (equivalently: X = CjjX). These notions 
constitute a foundation of Pawlak’s rough set theory [6,7,3]. 

One can prove that approximation operations Cl and Cjj satisfy the following 
conditions, for all A, F CO: 

(CO) -ClA = Cu{-X)y -CuX = Cl(-A). 

{Cl) ClX CXyXCCuXy 

(C2) if A C y , then ClX C ClY and CuX C CuY y 
(C3) ClCuX = CuXy CuClX = C^A, 

Notice that (C3) implies the idempotence condition: 

(C4) ClClX = CLXy CuCuX = CuXy 

since ClClX = ClCuClX = CuClX = C^A, and similarly for 

In fact, {OyCLyCu) is a 0-dimensional topological space, that is, a space 
based on a family of clopen subsets of O; Cu is the closure operation, and Cl 
is its dual, i.e. the interior operation, of this space. 

Incomplete information systems give rise to more general approximation op- 
erations which need not fulfill (CO). In this section, we briefly outline basic 
properties of these generalized notions. 

We define an approximation space as a triple S = {Oy Cl^ Cu) such that O is 
a set, and CJu are mappings from Pow(O) to Pow(O), satisfying (Cl), (C2) 
and (C3), for all A, A CO. Since (C4) follows from (C3), (C4) must also hold 
in every approximation space. 

A set X C O IS said to be definable in an approximation space A, if A = ClX 
(by (C3), A = ClX iff A = CuX). Def(A) denotes the family of all sets definable 
in S. 

One easily shows that Def(A) is closed under arbitrary joins and meets, hence 
it is a complete lattice of sets. Conversely, every complete lattice L of subsets of 
a set O equals Def(A), for the approximation space S = {Oy CuyCu) such that 
ClX is the largest set in L contained in A, and CuX is the smallest set in L 
containing A. 

Another characterization of approximation spaces can be given in terms of 
preordered sets. Let F = {Oy<) be a preordered set, that means, < is a reflexive 
and transitive relation on O. X set X C O is called a positive cone in A, if, 
for oil Xyp E Oy if X E X and x < y then y E X. Coii{F) denotes the family 
of all positive cones in F. Since Con (A) is a complete lattice of sets, then. 
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by the preceding paragraph, Con(P) =Def(S'), for some approximation space 
S = (O, Cl^ Cjj)- One proves that operations (7^, Cjj can be defined as follows: 

CLX={xeX :{yyeO){x<y^yeX)} (1) 

CuX = {yeO :{3xeX){x<y)}. (2) 

Conversely, every approximation space S = {O ^Cl^Cjj) determines a pre- 
ordered set P = {0^<) such that, for x^y E x < y for all X GDef(5), 
X E X entails y E X. It is easy to show Def(S') =Con(P). The relation < is 
called the specialization preorder of the approximation space S. 

[ Proposition 2]i^br every approximation space the following conditions are 
equivalent: 

(t) S fulfills (CO), 

(a) the specialization preorder of S is symmetric. 

Proof. Assume (i). Then, for all A CO,X GDef(S') iff —A GDef(A), which 
yields: x < y iS y < x^ for all x^y E O. Assume (ii). Then, the relation < is an 
equivalence relation on O. By (1), (2), ClX is the join of all equivalence classes 
of < totally contained in A, and CjjX is the join of all equivalence classes of < 
which are not disjoint with A, and these operations obviously fulfil (CO). 

Given an information system I = (0,A, V,p), one defines the relation □: for 
x^y E O, X □ ^ iff, for all A G A, if p{x^A) is defined then p(x,A) = p(y. A) 
(that means, object x is not more specified than object y). A set X C O is 
said to be definable in i, if A is a positive cone in the preordered set [O, □). 
Def(7) denotes the family of all sets definable in /. Clearly, Def(7) =Def(A), for 
the approximation space S = {O ^Cl^Cjj) in which Cl^Cjj are defined by (1), 
(2) with < interpreted as □. This approximation space is denoted by S[I). One 
proves that □ is precisely the specialization preorder of A(7). For x E we 
define [x] = {y E O : x C y} fii.e. [x] is the principal positive cone determined by 
object X, 

The label ‘definable sets’ for positive cones with respect to □ will be better 
justified in section 3, where we show that these sets are precisely the extensions 
of formulas of Lj, 

At the end of this section, we observe that every nonempty finite approxima- 
tion space S = {O^Cl^ Cjj) equals *7(7), for some two-valued information system 
7, satisfying (NKC). Define 7 = (C^, A, {0, l},p) by setting A =Def(7) and, for 
all X E C , A E A: 
p(x. A) = 1 iff X G A, 
p(x. A) is undefined, otherwise. 

This information system I satisfies (NKC), since O GDef(7) (use (Cl)), and 
consequently, p[x,0) = 1, for all x E O, 

[ Proposition 3] 7 = 7(7). 

Proof. We prove Def(7) =Def(i) (that entails 7 = 7(i)). First, we show C. 
Let A GDef(7). Assume x E A and x Cy. Then, p(x. A) = 1, and consequently. 
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p[y^A) = 1, which yields y ^ A. We have shown that A is a positive cone 
with respect to hence A GDef(i). Now, we show the converse inclusion. Let 
A GDef(i). Then, A is a positive cone with respect to □. Clearly, A is the 
join of all [x], for x G A. Since Def(N) is closed under arbitrary (also empty) 
joins, in order to prove A GDef(N) it suffices to show [x] GDef(N), for any 
X ^ O, Fix X ^ O. Let Ai, . . . , A^ be all A GDef(N) such that x G A. We have 
[x] = Ai n . . . n An, and consequently, [x] GDef(N). 

Actually, system /, defined above, is one- valued. One may extend it to a prop- 
erly two- valued system, by setting: p(x. A) = 0, if x G Cl(— A), and proposition 
3 for the extended system can be proved in a similar way. 



3 Definability 

In this section, we characterize sets definable and strongly definable in informa- 
tion systems in terms of propositional definability. Main results are first proved 
for two- valued information systems and, then, generalized for arbitrary informa- 
tion systems. 

Let 1 = (O, A, {0, 1}, p) be an information system. A set X C O is said to be 
propositionally definable in I/if X = P{p): for some formula p of Lj. By (PNO), 
X is propositionally definable in / iff W = for some formula of Lj, 

[ Theorem l]For any X C O ^ X is propositionally definable in I ifi and only 
if, X eDefO). 

Proof. By induction on we prove P{p)^ ^(a) ^L)ef(/). Since P{A) = {x : 
p(x. A) = 1}, then P{A) is a positive cone with respect to □, and similarly for 
N{A). For p = - 1 ^, we apply (PNl) and the induction hypothesis. For p = fiAXy 
we apply (PN2), the induction hypothesis and the fact that Def(i) is closed under 
joins and meets. 

Now, assume X GDef(i). If A =0, then X = P{AA->A)^ for any A G A. So, 
assume A A 0* p denote the disjunction of all 6j[x)^ for x G A. We have 
V x) = P{fi) U P{x)j by (PNl), (PN2), hence: 

P{^) = U P{Si{x)) = X, 
xex 

since P{Si{x)) = [x] and A GDef(i). 

Theorem 1 generalizes for non-two- valued information systems: for any infor- 
mation system /, Def(/) is precisely the family of sets propositionally definable 
in 'i'(L). That follows from theorem 1 and the equality Def(/) =Def (7 '(/)). 

Let I = (O, A, V,^), I' = (O', A^V^ A) be information systems. System /' 
is called an informational extension of system I (write: I □ i'), if O = O', 
A = A', V = V' and p ^ pf that means, for all x G O and A G A, if p(x. A) is 
defined, then p'(x. A) = p(x. A). One easily shows: I Q P entails P{p) C P'{p) 
and X [p) C A'(c/p), for all formulas p of Lj = Ljf . By Com(i) we denote the 
set of all complete informational extensions of L 
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A set X C O IS said to be strongly definable in i, if A GDef(i'), for all 
1' GCom(i). Accordingly, the sets strongly definable in an information system 
are the sets definable in all complete informational extensions of this system. 
Evidently, the notion of a strongly definable set deserves a close attention as 
another natural ‘incomplete’ variant of the notion of a definable set in complete 
information systems. Below we characterize this notion in terms of propositional 
definability. 

A formula (/? of Lj is said to be determined in /, if P{g^) U N{ip) = O. It 
is said to be positively determined in /, if P{p) = P'{^)^ for all informational 
extensions 1' of i. Every formula determined in I is also positively determined 
in i, but the converse does not hold (take p = A P{^) = 0 holds in all 
information systems, but P{p) U A(cp) = A(cp) does not contain all objects, 
if is not determined). We say that object x is incompatible with object y in 
an information system /, if there \s A ^ A such that p{x^A) p[y^A) (both 
defined) . 

[ Theorem 2]For any information system I = {O^A^ {0, 1}, p)^ and all X C 
the following conditions are equivalent: 

(i) X is strongly definable in fi 

(ii) for all X ^ X ^ y ^ X ^ X is incompatible with y in I ^ 

(Hi) X = P{p )7 for some formula, p determined in fi 

(iv) X — P{p)^ for some formula, p positively determined in fi 

(v) there is a formula, p such that X — P'{p)^ for all P ^Corn[I), 

Proof. Implications (iii)=>(iv), (iv)=>(v) and (v)=>(i) are obvious. We prove 

(i) =>(ii). Assume (ii) do not hold. Then, there exist x ^ X ^ y ^ X such that, for 
all attributes A, if both p[x^A) and pipy ^ A) are defined, then p[x^A) = pipy ^ A). 
Clearly, there exists P GCom(/) such that p\x^A) = p'ipy^A)^ for all attributes 
A, and consequently x ^// y. Therefore, X is not a join of equivalence classes of 
^//, hence X is not definable in i', and (i) fails. We prove (ii)=>(iii). Assume (ii). 
We consider two cases. (I) X = 0. Then, X = P{p)^ where p is the conjunction 
of all A, -lA, for A G A, and p is determined in i, since X [p) = O, by (NKC). 
(II) X A 0* p be the disjunction of all formulas 6j[x)^ for x E X. We show 
X C P{p). Let X e X. Since Djipx) ha di[x) and djipx) ha p, then Djipx) ha p^ 
hence x G P{p)^ by proposition 1. We show —X C Nipp). Let ^ ^ A. By 

(ii) , Dj{y) ha “>^/(^), for all x G A, hence by proposition 1, y G A(^/(x)), 
for all X G A. Accordingly, y G N{p) (use V y) = N{fi) fl A(y)). Since 
Pipp)r\N ipp) = 0, it follows that X = P[p) and — A = A(cp), and p is determined 
in L 

It is not that easy to generalize this theorem for non-two- valued information 
systems, since not every complete informational extension of T[I) is the two- 
valued representation of a system from Com(/). Let / = (0,A, V, p) be an 
information system. One easily shows that a system P GCom('i'(/)) equals T{J)^ 
for some J GCom(/), if, and only if, for all A G A, the following formulas are 
valid in P (i.e. their positive extensions equal O): 



^{{A,Vi) A for i ^ j, 
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{A,vi)V...V{A,Vn), 

where . . . , values in V. The set of all formulas of the above form 

will be denoted by com(7). 

This observation leads us to the following definitions. Let T be a set of 
formulas of L/, where I is two- valued. By Comp(i) we denote the set of all 
1' GCom(i) such that every formula from F is valid in 1' . A set C where 
O is the set of objects of i, is said to be strongly definable in I with regarding 
r ^ \i X is definable in all systems F GComp(i). For a formula (/? of L/, by 
Pr{^) (resp. Nr{g^)) we denote the meet of all sets P'{g^) (resp. N'{(f))^ for 

r GComr(/). 

[ Proposition 4]For all sets F ^ formulas ^ and objects there hold the follow- 
ing equivalences: 

(SDl) X G Pr(A) ^ffP^ Pi{^) ^2 A. 

(SD2) X G Ar(A) iff F\JDi{x) ^2 -A* 

A formula c/p is said to be determined in / with regarding T, if Prfp) U 
Nrfp) = Object x is said to be incompatible with object y in system I with 
regarding T, if the set F U Dj{x) U Fi{y) is unsatisfiable in the sense of classical 
logic. 

[ Theorem 3] For all information systems I = {O^A^ {0, 1}, p); all sets X C O 
and all sets F such that Comr{I) 0 the following conditions are equivalent: 

(i) X is strongly definable in I with regarding F ^ 

(ii) for all X ^ X ^ y ^ X ^ X is incompatible with y in I with regarding F ^ 

(Hi) X = Pp{ip)^ for some formula, p determined in I with regarding F ^ 

(iv) there is a formula, p such that X = P'{p)^ for all F ^Comr{l) ^ 

We omit proofs of proposition 4 and theorem 3. As an easy corollary from 
theorem 3, we obtain: for all information systems / = {O^A^V^fS) and sets 
X C X IS strongly definable in I iff there is a formula p (of Lt{i)) determined 
in T[I) with regarding com(/) such that X — Pcom{i)fp) iff there is a formula 
p (of Lt{I)) such that X = P'{p)^ for all systems F [1)) . Notice 

that F GComcom(/)C^'(^)) iff there exists J GCom(i) such that F = T[J), 
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Abstract. Probably the distinguishing concept in incomplete informa- 
tion analysis is that of ” boundary”: in fact a boundary is precisely the 
region that represents those doubts arising from our information gaps. In 
the paper it is shown that the rough set analysis adequately and elegantly 
grasps this notion via the algebraic features provided by co-Heyting al- 
gebras. 



1 Algebraic Views of Rough Set Systems 

Any Rough Set Systems, that is the family of all the rough sets induced by an 
Approximation Space over a set U (see the definition below), can be made into 
several logic-algebraic structures. In [7], for instance, the attention was focused 
on semi-simple Nelson algebras, Heyting algebras, double Stone algebras, three- 
valued Lukasiewicz algebras and Chain Based Lattices. In the present paper. 
Rough Set Systems are analyzed from the point of view of co-Heyting alge- 
bras. This new chapter in the algebraic analysis of Rough Sets does not follow 
from esthetic or completeness issues, but it is a pretty immediate consequence 
of interpreting the basic features of co-Heyting algebras (originally introduced 
by C. Rauszer [8] and investigated by W. Lawvere in the context of Contin- 
uum Physics) through the lenses of incomplete information analysis. Indeed in 
[3] and in [4], Lawvere pointed out the role that the co-intuitionistic negation 
”non” (dual to the intuitionistic negation ”not”) plays in grasping the geomet- 
rical notion of "boundary” as well as the physical concepts of "sub-body” and 
"essential core of a body" and we aim at providing an outline of how and to 
what extent they are mirrored by the basic features of incomplete information 
analysis. 



1.1 Indiscernibility Spaces and Rough Sets 

Given a (finite) universe U, when we take into account an equivalence relation 
E C U X U we assume that any equivalence class collects together the elements 
of U that are considered indiscernible from some point of view: in Rough Set 
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Analysis, typically a^b ^ U are indiscernible if they share all the possible (ob- 
servational) properties we are provided by some Information System (see [6]). In 
this perspective, any equivalence class modulo E will be called a basic category 
and the ordered pair {U^E) will be called an indiscernibility space. Let AS{U) 
be the atomic Boolean algebra having U/E as set of atoms. Then (t/, AS{U)) is 
a 0-dimensional topological space. This space is called an Approximation Space ^ 
but with the same term we will also refer to the topology AS{U) itself. As a 
topology, AS{U) will induce a closure operator C and an interior operator X 
from p{U) to AS{U). 

In Rough Set Analysis, VG G U^C{G) is called the upper approximation of G 
(with respect to E) and it is usually denoted by [uE){G)^ while X[G) is called 
the lower approximation of G and it is denoted by {IE){G). If {IE){G) = 0 (if 
{uE)[G) = Uj then G is said to be internally undefinahle [externally undefin- 
able). 

If two sets G, G' C U are such that [uE)[G) = [uE)[G') and [IE)[G) = 
[IE)[G')^ then G and G' are said to be rough equals G ^ Gb Thus rough equality 
is an equivalence relation on p[U). Any equivalence class of subsets of p[U) 
modulo the rough equality relation is called a rough set. 

1.2 Basic algebraic concepts 

Let L = (T, V, A, 0, 1) be a bounded distributive lattice. 

Definition 1. The right adjoint to meet; given a,b e L, set Vt G L, 

(i) a A X < b if and only if x < a D b; (ii) -Aa = a D 0. 

Then a D b is called the pseudo-complement of a relative to b and by definition 
it is the largest element x oi L such that a Ax is less than or equal to 6, while 
the element 4-a is called the pseudo- complement (or intuitionistic negation) of a 
and it is the largest element x of L such that a A x = 0. 

Definition 2. Heyting algebras; if for any a,b e L, the operation a D b is 
defined^ then H = (T, V, A, D, 4-, 0, 1) is called a Heyting algebra. 

Definition 3. The leet adjoint to join; given a,b e L, set Vt G L, 

(i) a V X >b if and only if x > a C b; (ii) = a C 1 . 

Then a C b is called the dual pseudo- complement of a relative to 6, and it is the 
smallest element x of L such that a V t is greater than or equal to 6, while ^a 
is called the dual pseudo- complement (or co -intuitionistic negation) of a and it 
is the smallest element x of L such that a V t = 1. 

Definition 4. go-Heyting algebras; if for any a,b e L, a C b is defined^ 
then CH = (T, V, A, C, 0, 1) is called a co-Heyting algebra. 

Definition 5. A bi-Heyting algebra is a bounded distributive lattice that is 
both a Heyting and a co-Heyting algebra. 
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EXAMPLES 

1 . The system of all closed subsets of a topological space is a co-Heyting algebra. 
In fact in this case given two closed subsets X, T, X C T = C{Y fl —X). It 
follows that ^X = C(-X) = -2(X). 

2. Any finite Heyting algebra is a bi-Heyting algebra. 

We must now remark that being a bi-Heyting algebra is an interesting con- 
dition in se but not per se: indeed if we consider a finite Heyting algebra H such 
that the top element, 1, is co-prime (that is, for any X C H such that i = VA 
1 G X) then H is surely a bi-Heyting algebra, but for any a G H, = 1 if 
a 7^ 1, = 0 otherwise: the co-intuitionistic negation carries a poor informa- 

tion, in this case. Therefore, though one can ”a priori” say that any Rough Set 
System is bi-Heyting algebra (since it is a finite distributive lattice, thus a finite 
Heyting algebra), nevertheless this framework becomes interesting if we are able 
to identify the bi-Heyting algebra operations and to relate them to the features 
of Rough Set Analysis. So let us start with the operations that we can define on 
rough sets. 

We have seen that any rough set Z is an equivalence class of subsets of U 
modulo the rough equality relation Thus Z G p{p{U)). But since Z is uniquely 
determined by the lower and the upper approximation of any of its element, we 
can denote it by < C{Zi)^X[Zi) > for Zi G Z. Therefore we introduce the 
following map: 

Definition 6. rs : p{U) i — ^ AS{U) x AS{U);rs{X) =< C(X),2(X) >. 

Then, we have: [X]- = rs“^(rs(X)), VX C U . If the pre-image of rs is a single- 
ton {X}, then we will denote it directly by X. 

We denote the image of p{U) along rs, by RS{U) and we call it the Rough Set 
System induced by AS{U). Clearly, if < X, T >G RS{U)^ then X DY. Partic- 
ularly, if G G AS{U) then C(G) = 2(G); hence any element of rs{AS{U)) has 
the form < X, X >. We will call any rough set of this kind an exaet rough set. 
Then, let us consider the following families, for any Boolean algebra B: 

V(B) = {< ai ,02 >G : ai > 02}; = {< B^ : ai = 02}- 

Clearly rs{AS{U)) = B{AS{U)) C RS{U) C V{AS{U)). 

Before short, we will identify precisely RS{U). 

From now on by a =< ai,a2 >^x =< xi^X2 > and so on, we will denote ele- 
ments of V(B). 

BASIC OPERATIONS ON V{B) 

1 . 1 =< 1,1 >; 

2 . a\f b =< ai V hi, 02 V 62 >; 

3. a — ^ b =< 02 — ^ bi, 02 — ^ ^2 > (weak implication); 

4. ^ a =< ^tt2, ^oi > (strong negation); 

where 1, V, — ^ and ^ applied inside the ordered pairs are the operations of the 
underlying Boolean algebra B. 
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From the above basic set, we can define other derived operations. Some of these 
are the operations that make it possible to made 7 ^(B) into a Heyting algebra 
and a co-Heyting algebra. 

DERIVED OPERATIONS ON V{B) 

5 . 0 =- 1 =< 0 , 0 >; 

6. a Ab aV ^ b) =< a\ A b\,a 2 A 62 >; 

7. a D b -i^aV6V(^aA-i^6) ( intuit ionistic implication); 

8. -ya -1 ^ a (intuitionistic negation); 

9. a (Z b a b) = Ab A —i ^ aV ^ ~'b)'j 

10. = a — ^ 0 =< ^a 2 , ^<22 > (weak negation), 

where 0, A and ^ applied inside the ordered pairs are the operations of B. 
Lemma 7. For any a G P(B), -ya = a D 0. 

Proposition 8 . facts about negations; For any a e V{B) 

1. a) a = a; b) c) -i- -r Fa = Fa. 

2. Si) FFa = —^Fa Fa = ^ a; b) = F—^a = F ^ a ~^a; 

3. If a =< x^x >, then ^ a = ^a = Fa. 

Proposition 9. For any Boolean algebra B, 

1. N(B) = (P(B), A, V, — - 1 , 0, 1) is a semi-simple Nelson algebra. 

2. H(B) = (P(B),A,V, 3, -y,0, 1) is a Heyting algebra. 

(The proofs for the above Lemma, Facts and Propositions are in [7]). 

Proposition 10 . For any Boolean algebra B, CH(B) = {V{B ), A,V,C,^,0,1) 
is a eo- Heyting algebra. 

Proof For any a^b e P(B), a < b iS ^ b a and ^ (a A 6) aV ^ b: these 
properties of ^ are inherited from the same properties of the negation of B. 
Thus, since from Proposition 9^aDb = \J{zZ P(B) : z A a < 6 }, it follows 
that ^ (^ a 3^ b) \/{z G F{B) : zA a b} \/{z G P{B) 
b {zA ^ a)} \/{z G V{B) : b z \/ a} = /\{^ z G V{B) : b z V a}. 

Since ^ is an involutive anti- isomorphism we obtain that C is a dual relative 
pseudocomplementation. 

Moreover, a C 1 = ^a A 1 A (^ —i ^ aV ^ — il) = ^a A (^ — ' ^ a V 1) = ^a. QED 

Corollary 11 . For any Boolean algebra B, BH(B) = (P(B), A, V, C, 3 , -F, 0, 1) 
is a bi-Heyting algebra. 

2 Rough Sets and bi-intuitionistic Operations 

Proposition 12 . W C U, 

1) ^ (rs(X)) = rs(-X); 2) T(rs(X)) = rs(-C(X)); 3) -(rs(X)) = rs{-l{X)). 
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Proof: 1) - {rs{X)) =-< C{X),I{X) >=< -X(X), -C{X) > 

= <C{-X),I{-X) >= rs{-X); 

2) ^rs{X)) = ^< C{X)a{X) >=< -C(X), -C{X) >= rs{-C{X)f 

3) ^{rs{X)) = ^ < C{X),I{X) >=< -I{Xf -I{X) >= rs{-I{X)), qed 
Combining 1 with 2 and 1 with 3 we obtain: 

Corollary 13. For any Rough Set System RS{U), for any X C U, 

1) - - {rs{X)) = rs{C{X)); 2) - ^{rs{X)) = rs{l{X)). 

3) - ^{RS{U)) = - - {RS{U)) = rs{A.S{U)) = ^{RS{U)) = ^{RS{U)). 

Proof 1) ^r^{rs{X))=rs{--C{X));2) ^{rs{X)) = rs{- - I{X)), 

3) Since both C and X map p{U) onto AS{U). qed 

It follows that -1 ^ (that is and ^ -< (that is can be considered as 

modal operators that parallel with {uE) and, respectively, {IE), 

3 Bodies and Boundaries via Co-Heyting Algebras 

Given an element a of a co-Heyting algebra CH, Lawvere calls the regular 
core of a. Generally < a. In order to appreciate this term, consider the 
above results: if a = rs(X) for X C then rs~^{^^a) = {IE){X)^ that is 
the necessary part of X, (in a literal sense when AS{U) is interpreted as an S5 
modal space -see for instance [5] ) . Moreover in [3] , it is claimed that a part a may 
be considered a sub-body (or shortly a body) if and only if = a. Thus the 
notion of ” sub-body” coincides in Rough Set Systems with that of exact rough 
set^ that is rough sets "(deductively) closed” and "perfect”: and not by chance 
we are following the terminology used by Leibniz for describing the notion of 
"individual substance” {Discourse on Metaphysics). 

But everything is centered on the fact that in co-Heyting algebras we can recap- 
ture the geometrical notion of "boundary”. Indeed Lawvere points out that this 
notion is definable by means of the co-intuitionistic negation, in the following 
manner (for a belonging to any co-Heyting algebra): 

d{a) = a A ^a. 

First of all, d{a) is the boundary of a in a topological sense: if the given co- 
Heyting algebra is that of Example 1, then a is a closed set of some topological 
space. Thus a G ^(a) = C(a) fl —2(a), that is exactly the topological boundary, 
JF(a), of a. 

More generally, d{a) is a boundary since for any a, b of any co-Heyting algebra, 
it formally fulfills the rules: 

1) d{a Ab) = {d{a) Ab) V {a A d{b)); 2) d{a A 6) V d{a V b) = d{a) V d{b). 

The first formula is called "Leibniz formula” by W. Lawvere who underlines 
that though its validity for boundaries of closed sets is supported by our space 
intuition (think of two partially overlapping ovals), nevertheless it is virtually 
unknown in general topology literature. 

Indeed we can notice that it is essentially the usual Leibniz rule for differentiation 
of a product (but see also the Grassmann rule). 
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Moreover, Lawvere notices that any element a in a co- Hey ting algebra is the join 
of its core and its boundary: a = V d{a). 

All these relations apply to any rough set. Particularly, if a = rs(X), then d{a) = 
aA^a =< a\ fl — a 2 , a 2 H —a 2 >=< C{X) fl — X(X), 0 >=< JF(X), 0 > (since the 
interior of d{a) is always empty, one can deduce the two-dimensionality of the 
rough sets with non empty interior, that is rough sets corresponding to internally 
definable subsets of the Approximation Space). Moreover, the boundary of X in 
the topological space (t/, AS{U)) is given by: X{X) = ^ (9(a))). In fact 

if Z G rs“^(9(a)), then C{Z) = X{X) and the preceding equation follows from 
Corollary 13. 

Now let us notice, that no boundary can contain isolated points of the given 
topology. This means that if a basic category A G U/E is a singleton, say 
then for no X C A C J~{X). In fact, either y G X or not. But this means 
A C 2(X) or A C —C{X): tertium non datur^ whereas any non-singleton basic 
category M, say {^, z}, a third possibility is allowed. In fact, we can find at least 
two sets y, Z such that y G z ^ Y and z G Z^y ^ Z (for instance we can 
trivially consider Y = {y} and Z = {z}). In this case neither M C 2{Y) nor 
M C — C(y), (the same for Z). 

But in Rough Set Analysis x is an isolated point if and only if we have a complete 
information about it: the basic category including x reduces to a singleton and 
this means that x can be discerned from any other object. Let us denote by B 
the union of all the isolated points of (t/, AS(t/)), and let us set P = U fl — H. 
Since a V -^a d{a) and P, 0 >=< U^B >, we have just established that: 

Proposition 14. For any Approximation Space AS(P), for any a G V{AS{U)), 
the following are equivalent: 

1. 3X C U such that a = rs{X); 

2. aiHB = P na2; 

3. ay ^a >< U,B >; 

4. d{a) << P,0 >. 

It is worth noticing that condition 14.3 claims that the tertium non datur a V -ya 
must hold with respect to the local top < U^B >. Dually, 14.4 says that the 
contradiction a A ^a must be invalid with respect to the local bottom < P, 0 >. 
Now the algebraic task is to find an appropriate filter in order to distinguish 
RS{U) within P(AS(P)). In and [7] a suitable filter for condition 14.2 has been 
introduced. Now we show that it makes the other new constraints work as well. 
Indeed in the quoted papers, it is proved that RS{U) can be recovered from 
P(AS(P)) via the filter generated by < P, P >. Shortly, let us set for a^b ^ 
P(AS(P)): Y) a = b \S3x >< U, P > s. t. a A x = x A b; h) J{a) = VN=' 

Then we have: 

Lemma 15. For any a G P(AS(P)), if a = J{a), then aiH B = B H 02 . 
Corollary 16. J<^’^>(P(AS(P))) = RS{U). 

Now we will prove: 
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Proposition 17. For any a G V{AS{U)), if a = J{a) then 
1) a V >< U, B >; 2) a A << P, 0 >. 

Proof aV-^a =< ai, a 2 > V < ^ai, ^ai >=< P, a 2 U^ai >. But since a is a fixed 
point of the operator J, in view of the previous Lemma we have 02 = a 2 U(Priai). 
Hence 02 > {B Dai). It follows that 02 U > (ai H P) U U P > P. 

The proof of the second statement is obtained by duality. QED 

4 Bi-Heyting Algebras, Modalities and Rough Sets 

We have seen that the two double negation sequences -1 ^ and ^ -1 can be 
considered a couple of modality operators. Indeed, according to the physical 
interpretation, in view of Corollary 13(3), given any a G P5(P), -1 ^ a sends a 
to the smallest ” sub-body” that contains a, whereas ^ -la sends a to the largest 
sub-body contained in a. Since Va G P(B),-i ^ a = -1 -i- a and ^ -la = -r-ia, 
we can connect our results to some interesting general properties of bi-Heyting 
algebras. Namely, let us define two operators and as follows: 

Definition 18. Given a a— complete bi-Heyting algebra BH, Va G BH, 

(i) ^0 — ^0 — dd/ (ii) • ^n-\-l — ~ 

(Hi) n(a) = A”=1 ni(a); (iv) 0(a) = V”=i <>i(a)- 

Then in [9] it is shown that for any a, D(a) is the largest complemented element 
of BH below a, while 0(a) is the smallest complemented element above a (the 
interested reader must take great care of the different notations: in the quoted 
paper the intuit ionistic negation is denoted by ^ and the co-intuitionistic 
negation ^ is denoted by So in that paper Di = -1 and Oi - 1 ); it 
follows that in Rough Set Systems □ = and O = Oi. In other words both the 
sequences □i,D 2 ,... and Oi,<> 2 ,... stabilize at step 1. These facts, as pointed 
out by in [9], are related to the De Morgan laws that, generally, fail in Heyting 
and, respectively, in co-Heyting algebras: 

Definition 19. Let H and CH be a Heyting and, respectively, a co-Heyting 
algebra. Then: 1) PL satisfies the De Morgan/s law for -^, if\/x,y, -i-{x A y) = 
-i-x V -^y; 2) CH satisfies the De Morgan/s law for if ^{x \/ y) = ^x A ^y. 

One can show that in bi-Heyting algebras the law for -h implies n(a) = -^->a 
and that the law for ^ implies 0(a) = -i-i-a. The reverse of the implications 
does not hold, but one can prove that both the laws actually hold in Rough Set 
Systems: 

Proposition 20. In any rough set system RS{U) the two laws of Definition 19 
hold. 

Proof: In a Heyting algebra H the De Morgan law for is equivalent to the fact 
that (H) is a sublattice of H (see [2]). Dually for ^ in co-Heyting algebras. 
But this is precisely the case for RS{U): from Proposition 8, -^->a — ^^a and 
-I -r a = -r -r a, any a, but from Corollary 13 we have immediately that and 
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are both multiplicative and additive (C is additive in any topological space 
and is multiplicative in any Heyting algebra. Dually for and T. -cf. also 
[1]-). QED 

Thus in Rough Set Systems the De Morgan rules for -i- and ^ hold. Since this is 
also equivalent to the fact that Va, and are complemented, from Corollary 
13(3) we obtain that for any exact rough set a, da = 0. Particularly, since 
= rs{AS{U)) = B{AS{U))^ it follows that Rough Set Systems are 
examples of those "lucky” situations in which sub-bodies form a Boolean algebra. 
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Abstract. An approximation space can be defined as a quintuple A = 
{T^U^ r)^ where F : T ^ U is a multifunction and ^ and F are 

unary operations on the power set of U. 



1 Introduction 

The use of multifunctions for approximations has been studied before under 
names such as compatibility relations^ binary relations and multi-valued map- 
pings^ see [9, p.63] for further references. Also, in [8], Yao and Lin use various 
binary relations for defining approximation operators. 

2 Preliminaries 

Denote the class of all subsets of U by 7^(L), and the diagonal of U by Ajj = 
{(x,x) : X E U}. If R C U xf/isa binary relation on L, then its inverse is 
denoted by R~^ = {{y^x) : {x^y) G R}. The relation R on U is reflexive if 
Ajj C Ry symmetric if R = R~^ and transitive if R o R c R. 

2.1 Definition. 

(1) A binary relation R on U is called: 

(a) a tolerance relation if it is reflexive and symmetric; 

(b) a preordering (or quasi- ordering^ see [7]) of U if it is reflexive and tran- 
sitive; 

(c) an equivalence relation if it is reflexive, symmetric and transitive. 

(2) ([7, p.54]). An operation F : V ^ £ is said to be a unit operation on T> if 
F{D) = U{i/^({x}) : X G D} for every set D e V. 

(3) An operation F : V{U) is 

(a) reflexive if it satisfies x G F[{x}) for every x G L; 

(b) symmetric if it satisfies x G F{{y}) => ?/ G F{{x}) for every x, ^ G U] 

(c) transitive if it satisfies {CA F{F{{x})) C F{{x}) for every x G t/. 

(4) An operation F : V{U) T^(^) tolerance (respectively, a preordering] 
equivalence) on V[U) if it is reflexive and symmetric (respectively, reflexive 
and transitive; reflexive^ symmetric and transitive) on V{U). 

2.2 Lemma. \f F : V{U) ^ V{U) is a unit operation, then the conditions 
(Cl), (C2) and (C3) are equivalent, where 

(C2) y G F{{x}) => F{{y}) C F{{x}) for every x, ^ G U] 

(C3) {z G F{{y}) and y G !/^({x})) => z G F{{x}) for every x^y^z £ U , 

L. Polkowski and A. Skowron (Eds.): RSCTC’98, LNAI 1424, pp. 131-138, 1998. 

(c) Springer-Verlag Berlin Heidelberg 1998 
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3 The approach that uses relations 

Pawlak [5] defines an approximation space to be an ordered pair A = 
where R C U x U an indiscernihility relation on U , 

3.1 Tolerance relations 

If it is a tolerance relation on t/, if x G R and \x\ji is the tolerance class deter- 
mined by X, then the upper approximation operation R : V{U) £^nd the 

lower approximation operation R : V{U) V{U) with respect to R are defined, 

for every set A G V{R)^ by 

R{A) = {x G t/ : \x\r n a 0} and R{A) = {x G t/ \x\r C A} respectively. 

3.1.1 Example. Define the tolerance relation it on R by xRy (x,^ G (t — 
1, t -h 1) for some t G M). Then it([0, 1]) = {x G M : |x|i^ fl [0, 1] 7 ^ 0} = (—2,3) 
and it([0, 1]) = 0. Also, it([0, 4]) = (—2, 6) and it([0, 4]) = {2}. 

3.2 Equivalence relations 

If R is an equivalence relation on t/, if x G t/ and [x]r is the equivalence class 
determined by x, then R{A) = {x G t/ : [x]/^ fl A 7 ^ 0} defines the upper ap- 
proximation and R{A) = {x G ti [x]r C A} the lower approximation operation 
respectively. It is clear that the operations it and R are equivalence on V{U) by 
definition 2.1(4) 

3.2.1 Examples. 

(1) Define the equivalence relation it on R by 

xRy ^ {x^y E [[^J, L^J + 1) for some n G A), where [xj is the largest integer 
less than or equal to x. If x G R, then [x]r = [[xJ, [xJ + 1). it([0, 1]) = {x G 
R : [x]i^ n [0, 1] 7 ^ 0} = [0, 2) and it([0, 1]) = {x G R : [x]i^ C [0, 1]} = [0, 1). 

(2) Let a < b and U = (a, 6). Let xRy ^ x—y is a rational number. If A = (c, d), 
where a < c < d < then R{{c^d)) = 0 and it((c,d)) = [/. If x G A is a 
rational number, then R{[x]r) = [x]r = R[[x]r). 

A Pawlak approximation space A = (L, it) induces a triple (L, it, R) or (L, it, A), 
depending on whether it is a tolerance or and equivalence relation on U . 

3.3 Definition. Let it be an indiscernibility relation on U and let A= ([/, it). 

(1) If it is a tolerance relation, then (L, it, R) is called the generalized, approxi- 
mation space of Pawlak induced by A. 

(2) If R is an equivalence relation, then (L, it, A) is called the approximation 
space of Pawlak induced by A. 
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4 The approach that uses covers 

The approach to approximation spaces through the means of a cover for the 
universe U is due to Zakowski [10] . For a binary relation RoiiU ^ let the operation 
R : V{U) 'PiU) be defined, for every set A E P{U): by 

(q;) R{^) = {y ^ U : 3x E a such that xRy}, 

If R is reflexive, then H is a reflexive operation. Consequently, every reflexive 
relation R on U uniquely determines a cover for U : 

(/?) Cr = {R{{x}):xeU}, 

If it is a tolerance (an equivalence) relation on t/, then Cr is the collection 
of all tolerance (equivalence) classes of it, determined by the elements of U . 
If R is an equivalence relation on t/,then Cr is a partition of U . In fact, it is 
common knowledge that if it is an equivalence relation on [/, then the collection 
of equivalence classes is a partition Cr of U that induces the relation it, and if C 
is a partition of [/, then the induced relation Rq on U is an equivalence relation 
whose collection of equivalence classes is exactly C, see for example, Halmos [2, 
p.28] . Every cover (partition) C of U induces a tolerance (an equivalence) relation 
Rc on U: 

( 7 ) xRcy {x^y E U and 3C E C {x^y E CJ)). 

Note further, from the definition of Rc in ( 7 ) above, that 

(^) Rc{{x}) = U{C E C : x: E C} = \x\r^ for every x E U . 

Consequently, the sets of C need not be the tolerance classes of Rc^ whereas, 
from (/?), 

(g) Cr^ — {i^c({^}) • X E U} is the collection of tolerance classes of Rq> 

If C is a partition then the sets of C are the equivalence classes of Rq> 

4.1 Definition. If t/ is a nonempty set and C a cover for t/, then the ordered 
pair (C, C) is called an approximation space of Zakowski [10]. 

Given an approximation space A = (C, C) as in definition 4.1. Following 
Zakowski, we define two approximation operations on V{U). For every set A E 
V{U)^ the upper approximation operation C : V{U) the lower- 

approximation operation C : V{U) P{U) are defined by 

(C) C{A) = U{C G C : C n A 0} and {y)C{A) = U{C e C : C c A} 
respectively. 

We also define a weak lower approximation operation C^ : P{U) [ 6 , 

p.656]: 

{0) C[{A) = {y E U \ C{{y}) C A} for every A E P(C)- 
In general, C^{A) C C(A) for every set A E V{U)^ and = C if C is a partition 
of [/. 

An approximation space of Zakowski can induce at least two triples, (C, C,Q 
or (t/,C,C'). 

If we restrict both concepts of approximation spaces to the spaces determined 
by equivalence relations and partitions respectively, then both approaches to 
them are equivalent. The notion of approximation spaces determined by a cover 
(the approach by Zakowski) is in some sense a generalization of the classical one 
defined by Pawlak, as given in definition 3.3(2). 
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5 The equivalence of the two approaches 

5.1 Question Under what conditions are the two presented approaches to ap- 
proximation spaces equivalent? Wybraniec-Skardowska [7], established, among 
others, the following. 

5.2 Results 

(1) If C is a cover for U, then (t/,C, C^) = (U, ^ )• 

_ _^c _ 

( 2 ) If C is a partition of [/, then (U, = (U, C,C^) = {U^ Rc^ Rc)' 

(3) If it! is a tolerance relation on U, then, for any set Ac U, 

(a) R{A) = Cr{A) 

(b) R{A)=C^{A). 

(4) If R is an equivalence relation on U , then for any set A C U , 

(a) R{A) = Cr{A) 

ih) R{A)=^r{A). 

and then {U^R^R) = on U , 

Results 5.2 (1),(2) and (4) yield the statement that if we restrict both con- 
cepts of approximation spaces to the spaces determined by partitions and equiv- 
alence relations respectively, then both approaches to them are equivalent. 

6 The approach by multifunctions 

Henceforth, we assume that T and U are nonempty sets, with U not necessarily 
finite, and that F : T ^ U is a strict multifunction. If A C U, then F^[A) = 

{t £ T : F{t) C A} and F~{A) = {t £ T : F{t)nA ^ 0}, where F^ is the strong 
and F~ the weak inverse of respectively. Furthermore, 

{i) R{B) = U{F{t) : t G R} for every set B C T 

\i F : T ^ U is a multifunction, then F can also be regarded as a unary 
operation F : V{T) B{U)^ and (t) now becomes, for every set B G V{T)^ 

[F) F{B) = U{U({t}) : t G B} so that U is a unit operation on V{T), 

The set F{T) is called the range of F. If F(^F) = U, then F is a surjection. 
We assume throughout this section that F is a surjection. The class F[T) = 
{F[t) : t E T} IS then a cover for U . Let R be a tolerance relation on U . By 
section 4, R uniquely determines a cover Cr for U, and Cr is the collection of 
all tolerance classes of R, determined by the elements of U . Write 
(A) CR={Et:te T} 

where xRy if and only if x^y E Ft for some t E where T is the index set for 
the collection of tolerance classes of R. This clearly defines a surjective multi- 
function Fr : T ^ U by FR{t) = F-t for every t E T. Consequently, a tolerance 
relation R on U uniquely determines a surjective multifunction Fr : T U, 
and hence a cover Cr = Fr{T) = {FR^t) : t G F} for U . Similarly, if R is 
an equivalence relation on U, then it uniquely determines a partition Cr of U ^ 
where the elements of Cr are exactly the equivalence classes of R : the class 
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Cr = = {Fji{t) : t E T} IS then a partition of U by equivalence classes 

of R. Let F : T ^ U he 8i multifunction. By definition 4.1, the ordered pair 
A = is an approximation space of Zakowski. The cover F[T) for U 

induces a tolerance relation Rj=-(^t) on L, so that by ( 7 ) in section 4, 

(/i) xRt{T)V E U and G T such that x^y E 

In the notation of [3, Definition 7.2.1], let 

{y) G[x) = U{T(t) G F{^F) : x E F{t)} for every x E U, 

The set G[x) is said to be an indi seer nihility neighborhood of x. By (4), 

(C -R;^(T)({a^}) = {yeU : xRjr(T)y} = U{F(t) e R{T) : X e F{t)} = 

\x\r^^^^ — G[x) for every x E U ^ and for every set A C t/, (o;) in section 
4 becomes 

(o) Rj7(^x^{A) = { 2 / G L : 3x E A such that x E Rj=-(T)y} = D{i^jr(T)({^}) • 

X E A} = U{G(x) : X E A} = G[A) by [F). 

Henceforth, we shall use G instead of G can be regarded as a multifunc- 
tion, G : U ^ t/, or as an operation G : V{U) V{U)^ where G{x) = G({x}) 

for every x E U . The operation G : V{U) is clearly tolerance and, if 

X E F(t)^ then F(t) C G{x). The sets in F{^F) are not necessarily the tolerance 
classes of Rr{t). By (g) in section 4, the collection 

(tt) — {G{x) : X E U} IS the collection of tolerance classes of Rr(^t)' 

For the approximation space A = (t/, JF(T)), we define, in the notation of 
[3], the following four operations on 7^(L). For Ac F, let 

{q) FF-{A) = F{F-{A)) = U{F{t) e F{T) : F{t) n A ^ 0}, 
the upper approximation of A in Al (corresponding to (^) in section 4); 

(a) TT+(A) = T(T+(A)) = U{F{t) E F{T) : F{t) C A}, 
the lower approximation of A in A (corresponding to ( 7 ) in section 4); 

(r) Gi{A) = {x E A: VF{t) E F{T) {x E F{t) ^ F{t) c A)}, 
the weak lower approximation of A in A (it will follow from (^) below that this 
corresponds to [0) in section 4); 

(i;) G2{A) = U\FF+{U\A) 

the strong upper approximation of A in A. For every set A cU and every x eU ^ 
(0) FF~[A) = {x E U \3t eT such that x E F{t) and F{t) n A A 0} 

lx) FF-{{x}) = U{F{t) E F{T) : x G F{t)} = G{x) = \x\r^^^^ 

(A) Gi{A) = {xeA: FF~{{x}) C A} = G+(A) 

{lv) G2{A) = {xeU : VT(t) G F{T) {x E F{t) ^ F{t) f1 A A 0)} 

Also Gi{A) C TT+(A), G 2 (A) C FF~{A). Thus, if F : T ^ U is a (sur- 
jective) multifunction, then F[T) is a cover for t/, and the approximation space 
A= {U^F{T)) induces the triples (t/, TT+), (t/, t?i), ([/, G 2 , 

and {U,G2,Gi). 



6.1 Example. Let F : R ^ R be defined by F{t) = (t — l,t + 1) for every 
t G R. Then A = (R, ^(R)) is an approximation space in the sense of Zakowski, 
where JF(R) = {(t— l,t+l) : t E R} is a cover for R. Define the tolerance 
relation on R by xRR(y^y (x,^ G (t — l,t + 1) for some t E R). If 

X G R, then, by (^) i^jr(M)({^}) = = G[x) = (x — 2, x + 2) so that, for 

example, = (1 , 5), H;t(r)({4}) = (2 ,6) and 4, 6}) = (-1,8). 
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Clearly, the sets in namely the intervals (t — 1, t + 2), are not the toler- 
ance classes of Rj=-(uy By (tt) and example 3.1.1, = {(t — 2, t+2) : t E R} is 

the collection of tolerance classes of Rj=-(Ry Let A = [—6, 8]. Then, by (p), (a), (r) 
and (u), we have that FF~{A) = (—8,10), FF^{A) = (—6,8), Gi[A) = (—4,6) 
and G 2 {A) = [—6,8]. It is clear that the approximation spaces [U ^ F F~ ^ F F^) ^ 
([/, Gi), ([/, G 2 ,TT+) and (C, G' 2 ,G'i) are mutually distinct. 

By (o) and (</)), 1]) = U{H;t(t)({x}) : x € [0,1]} = <^([0,1]) = 

U{G(x) : X E [0,1]} = TF“([0, 1]) = (—2,3). If F{T) is a partition of [/, 
then Rr{t) is equivalence relation on U^G[x) = [x]r^^^^ — F{t) if x G 
T(t), FF+ = Gi on V{U),FF~ = G 2 on V{U) because FF-{A)dGi{U\A) = 0. 

6.2 Lemma. ([4, p.l67]). A multifunction F : T ^ U is semi-single- valued if 
and only if there exists an equivalence relation Ron U and a function f : T ^ U 
such that F = Ro f , 

\i F : T ^ U is semi-single- valued, then F{^F) is a partition of U , 



6.3 Lemma. 

(1) Ifi'-T ^ 1/ is a multifunction, then for A C U and x E U: 

(a) FF~{A) = U{F F~ {{x}) : x E A} 

(b) i^jr(T)(^) = = Li{i^jT(T)({^}) • ^ ^ = U{G(x) : x E A} = 

G{A) 

(c) FF ({x}) = G(x) = 

(d) Rt{t){A) = U{G'(x) : X e A} = U{FF-{{x}) : x e A} = FF-{A) 

(e) Rr(T){A) = {y eU : n A 7 ^ 0} = Rjr{T){A). 

(2) If is semi-single- valued, then for A dU and x E U: 

(f) H;t(t)({x}) = = G{x) = F{t) if xe F{t) 

(g) i?jr(T)(A) = {y eU : [y]i?^(T) n A / 0} = i?jr(T)(A). 

Lemmas 6.4 and 6.5 below stipulate correspondences between some of the 
ordered triples of definition 3.3 and those introduced in this section. 

6.4 Lemma. 

(1) If A : 2’ ^ L is a multifunction, then (L, t?i) = (L, ^ )• 

J-(T) 

( 2 ) Ifi-AT ^ L is a semi-single- valued multifunction, then 
{U,FF-,FF+) = (L/,TT-,Gi) = ([7,^2, TT+) = (t/,G2,Gi) = 

( L , Rr(t),Rr(t))' 



6.5 Lemma. 

(1) If A! is a tolerance relation on L, Fr : T ^ U the corresponding multifunc- 
tion and A any subset of [/, then: 

(a) R{A) = FrF^{A) 

(b) R{A) = FrF+{A). 
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(2) If R is an equivalence relation on U ^ Fji : T ^ U the corresponding multi- 
function and A any subset of f/, then: 

(c) R{A)=FrF^{A) 

(d) R{A) = FrF+{A) and {U,R,R) = {U , FrFj^, FrF+). 

7 An approximation space 

Wybraniec-Skardowska [7, p.54] defines an approximation space by A = {U ^<P^F) 
where U is the universe of A and <P : F{U) F{^) and F : V{U) V{U) such 
that, for A C [/ 

(al) ^(A) C U and F{A) C A 
(a2) ^(A) = U{^({x}) : X G A} 

(a3) r(A) = {xeU A ^({^}) ^ A}- 

The operations <P and F are respectively called the upper and lower approx- 
imation operations associated with A. We take this one step further. 

7.1 Definition. An approximation space is an ordered quintuple 

A = {T^ f/, F^ <P^ T), where T and U (the universe) are nonempty sets, F : T ^ U 
is a multifunction,# : V{U) ^{U) and F : V{U) ^ V{U) are unary operations 
satisfying the properties (al)-(a3). 

We now employ the work on multifunctions from section 6. Note that the 
operation FF~ : P{U) satisfies FF~{A) = U{F F~ {{x}) : x G A} for 

every A G 7^(N), so that FF~ is a unit operator on 7^(N). If we let T = U and 
F = <P on the class S of singleton subsets of f/, then because F[A) = U{F[t) : 
t G A} = U{#({t}) : t G A} = #(A), and T+(A) = {x G N : F{x) C A} = 

{x G N : 0 A N(^) C A} = {x G N : 0 A ^({^}) ^ strict by 

assumption), we can reduce the quintuple A = (T, [/, F, #, F) in definition 7.1 to 
the triple A = (t/, #, F) = ([/, F, F+) in the definition of Wybraniec-Skardowska. 

7.2 Remsirks. 

( 1 ) IIF -.T ^ is a multifunction, then the cover F(7') induces the tolerance 

relation Rj^(t) on U . By lemma 6.4, FF~[A) = Fjt(t)(A) for every A G 
V{U) and Gi[A) = R (A) where, by (^) in section 6, Gi[A) = {x G A : 

FF-({x}) c A} and A = (F, [/, F, FF“, Gi) = {T,U, F,Rj.^t). R ) is 

an approximation space determined by with Rj=-(t) and R the upper 

and lower approximations, respectively. Example 3.1.1 illustrates this. 

(2) If the multifunction F : T ^ U is semi-single- valued, then the class F{T) is 
a partition of U and Fjt(t) is the induced equivalence relation. Then, lemma 
6.4, and [3, Corollary 7. 4. 3(2)], 

.4 = {T, U, F, FF-, FF+) = {T, U, F, FF-,Gi) = (T, U, F, G 2 , FF+) = 
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(3) If it! is a tolerance relation on t/, then R uniquely determines a cover Cr 
(section 4) which in turn defines a surjective multifunction Fr : T ^ U (see 
(A) in section 6), where by (/?) and (4) in section 4, Cr = {i^({x}) : x G 
t/}, and i^({x}) = \x\r is the it-tolerance class determined by x. Define 

R : V{U) V{U) by R{A) = {y e U : $ ^ ^(M) ^ A}- Then R{A) = 
R (A) and we have the approximation space A = (i( R, Fr^ FF~ ^ Gi) = 

-Tr{T) 

(i( u, Fr, r, r) = (i( u, Fr, r,r ) = (i; u, Fr, X g+), 

by lemma 6.3(d) and (^) in section 6. If it is an equivalence relation, then 
R — R and R — R. Then the class Fr{T) = {FR{t) : t G T} is a partition 

of [/, and as noticed just above Lemma 6.2, FF^ = Gi and FF~ = G 2 . In 
this case we also have that A = (T, [/, Fr^ G 2 , FF^) = (T, [/, Fr^ it, R), 

(4) If C is a cover for [/, then by 5.2(1), (ti,C,C^) = (t/,itc,it ) and again we 

can arrive at Al = (i( L, T, itc, it ) = (i( t/, T, C, C^) where F : T ^ U is 

yC 

defined as indicated by (A) in section 6. If C is a partition of [/, then by 5.2(2), 
(t/, C,C) = (L,C,C^) = (L, RciHc) have that A = (i( L, T, RciHc) ^ 

(i;L/,T,C,Q. 

( 5 ) For other properties of the operations FF~ : F{U) T^{^) FF^ : 

V{U) V{U) and for the topologies associated by FF~ and see [3]. 
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1 Introduction 

Information systems (in the sense of Pawlak) are used for representing properties of 
objects by the means of attributes and their values. However, sometimes there are situ- 
ations in which we cannot give an exact value of an attribute for an object; we can only 
approximate the incompletely known value by a subset of values, in which the actual 
value is expected to be. 

In these kind of information systems we can define several information relations 
on the object set. It seems that these relations have many common properties. Here we 
introduce strong and weak preimage relations, which are suitable for the investigation 
of those common features. 

Dependence spaces are general settings for the study of reducts of attribute sets, for 
example. In here we consider especially dense families of dependence spaces, and we 
give a characterization of reducts by the means of dense families. Dependence spaces 
defined by strong and weak preimage relations are also studied. In addition to this, 
we introduce matrices of preimage relations and show how we can determine dense 
families of dependence spaces defined by strong and weak preimage relations by using 
these matrices. 

This paper is structured as follows. In the next section we define information sys- 
tems and some information relations. Section 3 contains the definition of strong and 
weak preimage relations. In Section 4 we give some basic concepts concerning de- 
pendence spaces and show how strong and weak preimage relations define dependence 
spaces. Finally, in Section 5 dense families of dependence spaces and matrices of preim- 
age relations are studied. 



2 Information Systems and Information Relations 

An information system is a triple S = {U, A, {Va}aeA), where f/ is a set of objects, A 
is a set of attributes, and {Va}aeA is an indexed set of value sets of attributes. All these 
sets are assumed to be finite and nonempty. Each attribute is a function a-.U^ p{Va) 
which assigns subsets of values to objects such that a{x) f 0, for all a G A and x e U 
(see e.g. [5,6]). 

If I a{x)\ = 1, then the information of the attribute a for the object x is complete (or 
deterministic), and we usually write a(x) = {t;} simply by a(x) = v. If |a(x)| > 1, 
then the information of the attribute a for the object x is incomplete (or nondeterminis- 
tic). For example, if a is “age”, and x is 25 years old, then a(x) = 25. It is also possible 
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that we know the age of x only approximately, say between 20 and 28. In this case 

a(x) = {20, 

In the following we shall present 16 information relations between objects, which 
are based on their values of attributes. These relations can be found in [5,6], for ex- 
ample. Suppose S = {U,A, {Va}aeA) is an information system and let B{C A) be 
a subset of attributes. The first eight relations reflect indistinguishability between ob- 
jects: 

• strong (weak) indiscernibility: 

(x, y) G ind[B) {wind[B)) iff a{x) = a[y) for all (some) a ^ B; 

• strong (weak) similarity: 

(x, y) G sim[B) [wsim[B)) iff a[x) fl a[y) ^ 0 for all (some) a ^ B; 

• strong (weak) forward inclusion: 

{^^y) ^ (^/^^(^)) Afa{x) C a[y) for all (some) a £ B; 

• strong (weak) backward inclusion: 

{x^y) G bin[B) [whin[B)) iff a[y) C a[x) for all (some) a ^ B, 
where —a{x) denotes the complement of a{x) in 14 . 

For example, two objects are strongly i^-indiscernible if they have the same values 
for all attributes in B, and they are weakly S-similar if there is an attribute in B such 
that these objects have at least one common value for this attribute. 

The following eight relations reflect distinguishability between objects: 

• strong (weak) diversity: 

(x^y) G div[B) {wdiv[B)) \ffa{x) f a[y) for all (some) a e B; 

• strong (weak) right orthogonality: 

{x^y) G rort{B) {wrort{B)) iff a{x) C —a{y) for all (some) a £ B; 

• strong (weak) right negative similarity: 

(x, y) G rnim[B) {wrnim[B)) iff a[x) fl —a[y) f 0 for all (some) a £ B; 

• strong (weak) left negative similarity: 

{x^y) G lnim[B) {wlnim[B)) iff —a{x) fl a[y) f 0 for all (some) a E B. 

For example, two objects are weakly S-diverse if their values for all attributes in B are 
not the same, and two objects are strongly right S-orthogonal if they have no common 
value for any attribute in B. 

3 Preimage Relations 

All information relations defined in the end of the previous section are similar in the 
following sense. Two objects belong to a certain strong (resp. weak) information rela- 
tion with respect to an attribute set B if and only if their all (resp. some) value sets of 
S-attributes are in a specified relation. For example, objects x and y are in the relation 
sim[B) if and only if a{x) fl a[y) f 0 for all attributes a in B. 

In this section we introduce preimage relations. This notion allows us to study prop- 
erties of information relations in a more abstract setting. We denote by Rel(A) the set 
of all binary relations on a set A. The complement of any relation R(g Rel(A)) is 
—R = {{x^y) G I (x^y) ^ R}. The set of all maps from A to is denoted by B^. 
Moreover, we assume that U and Y are nonempty sets, and R G Rel(R). 
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For any map f , the preimage relation of R is 

r\R) = {{x,y) eU^ \ f{x)Rf{y)}. 

Thus, two elements x and y are in the relation f~^[R) if and only if their images f[x) 
and f{y) are in the relation R. For example, if R is the equality relation, =, then f~^{R) 
is the kernel of the map /, ker / = {{x, y) \ f{x) = f{y)}. 

The following obvious lemma shows that f~^{R) inherits many properties from R. 



[ Lemma l.]If R is reflexive, then f ^{R) is reflexive, and similar conditions hold 
when R is irreflexive, symmetric, or transitive □ 

It is also true that 

f-\-R) = -f-\R). 

[ Example 7.]Suppose S = {U,A, {Va}aeA) is an information system, which is de- 
scribed in the following table. 





Age (years) Weight (kg) Height (cm) 


PI 


{22,.. 


.,26} (48,.. 


.,54} (154,.. 


.,157} 


P2 


(26,.. 


.,33} (73,.. 


.,78} (170,.. 


.,175} 


P3 


(24, . . 


.,29} (51,.. 


.,58} {159, . . 


.,162} 


P4 


(31,.. 


.,37} (75,.. 


.,82} {157,.. 


.,165} 



We denote V = IJaeA and Y = p{V) — {0}. Let us define the binary relation SIM 
on Y by setting 

{Wu e SIM ^ WiC\W2fl$. 

The preimage relations of SIM with respect to the attributes Age, Weight, and Height 
are the following. 

Age“^(S'/M) = {{x^y) G \ Age{x) D Age{y) 0} 

= AU{(P1,P2),(P2,P1),(P1,P3),(P3,P1), 
(P2,P3),(P3,P2),(P2,P4),(P4,P2)} 
Weight-^ (S'/M) = {{x,y) e \ Weight(x) n Weight ( 2 /) 7 ^ 0} 

= Zi U {(PI, P3), (P3, PI), (P2, P4), (P4, P2)} 
Height“^(S/M) = {{x,y) e (P \ Height (x) n Height (?/) 7 ^ 0} 

= Zi U {(PI, P4), (P4, PI), (P3, P4), (P4, P3)} 

Here A is the diagonal relation of U ; that is, A = {(x, x) \ x e U}. Two objects are in 
the relation Age” ^ (S/M) if and only if their ages are possibly the same, for example. 

Next we shall extent the notion of preimage relations in a natural way. Let A{C Y^) 
be a nonempty set of functions. The strong and weak preimage relations of a subset 
BfZ A) are defined by 

Sr{B) = {(x,y) e I (V/ e B)f{x)Rf{y)}- 

Wr{B) = {{x,y) e I (3/ € B)f{x)Rf{y)}, 
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respectively. The following properties are clear by the definition of strong and weak 
preimage relations. 



[ Lemma 2.]If B, CCA and f E A, then 
{^)SR{{f}) = WR{{f}) = f-\R); 

(b) Sr{B) = \f^B} andWR{B) = \J{f-\R) \ f € B}; 

(c) S'ii(0) = U xU and Wr{$) = 0; 

(d) Sr{B UC) = Sr{B) n Sr{C) andWR{B UC) = Wr{B) U Wr{C); 

(e) IfB C C, then Sr(C) C Sr(B) and Wr{B) C Wr{C); 

(f) IfB ^ 0, then Sr{B) C Wr{BI' 

(g) -Sr{B) = Wt^_Ry{B) and-WR{B) = S^^_r){B). □ 

The following obvious proposition shows that also strong and weak preimage rela- 
tions inherit many properties from the original relation. 

[ Proposition \.]Let 9 ^ B C A If R is reflexive, then Sr[B) and Wr[B) are 
reflexive, and similar conditions apply when R is irreflexive or symmetric. Moreover, if 
R is transitive, then Sr[B) is transitive. □ 



Information relations are preimage relations, as we see in the following example. 



[ Example 2.] Assume S = ill, A,{Va}a^A) is an information system. As before, we 
setV^ = Uae,4Kandy = p(I^)-{0}. 

Now we can define the following relations on the set Y . 



{Wi, W 2 ) e IND 
\wi, IV 2 ) e SIM 
{Wi, W 2 ) e FIN 
\Wi,W2) e BIN 
{Wi,W2) e DIV 
{Wi,W 2 ) e RORT 
\Wi,W2) e RNIM 
\wi, W 2 ) e LNIM 




Wi = IV2; 
[ViD[V2f 0 ; 
Wi c [V 2 ; 

Wi D W 2 ; 

Wi f W 2 ; 

Wi C -IV 2 ; 

Wi n -W 2 f 0 ; 
-fTi r\W2f^. 



It can be easily seen that —IND = DIV, —SIM = RORT, —FIN = RNIM, and 
-BIN = LNIM. 

For any subset B{C A) of attributes, 



indi^B') = Sif^Rji^B') 


and 


sim{B) = Ssim{B) 


and 


fin{B) = Sfin{B) 


and 


bin{B) = Sbin{B) 


and 


divA) = Sf>iv{B) 


and 


rort{B) = Srort{B) 


and 


rnim{B) = Srnim{B) 


and 


lnirn{B) = Slnim{B) 


and 



wind[B) = Wind{B); 
wsim{B) = Wsim{B); 
wfin{B) = Wfin{B); 
wbin{B) = Wbin{B); 
wdiv{B) = Wdiv{B)-, 
wrorfB) = Wrort{B)-, 
wrnim{B) = Wrnim{B); 
wlnim[B) = Wlmim{B). 
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Because the relation IN D is trivially reflexive, symmetric, and transitive, then by 
Proposition 1, the relation ind{B) = Sind{-^) is reflexive, symmetric, and transitive. 
Moreover, by Lemma 2(g), —IND = DIV implies that —ind[B) = —Sjnd{-^) = 
= ^Div{-^) = wdiv{B), for example. 

Preimage relations allow us to deflne also different kind of information relations, as 
we see in the following example. 

[ Example 5.]Suppose S = ([/, A, {Va}aeA) is an information system, which is given 
by the following table. 





Height (cm) Weight (kg) 


PI 


186 


80 


P2 


157 


59 


P3 


172 


64 


P4 


166 


52 



Let us now consider the usual order relation > on N. The preimage relations of > with 
respect to the attributes Height and Weight are 

Height-'(>) = {(P1,P2),(P1,P3),(P1,P4),(P3,P2),(P3,P4),(P4,P2)} 
Weight-'(>) = {(P1,P2),(P1,P3),(P1,P4),(P3,P2),(P3,P4),(P2,P4)}. 

For all B C A, 

S>{B) = {(x,y) e I (Va e B)a{x) > a(y)}; 

W>{B) = {{x,y) e I (3a e B)a{x) > a{y)}. 

For example, 

^>(A) = {(PI, P2), (PI, P3), (PI, P4), (P3, P2), (P3, P4)} and 
fP>(A) = ^>(A) U {(P2, P4), (P4, P2)}. 

4 Dependence Spaces of Preimage Relations 

Dependence spaces are algebraic structures which are suitable for the study of reducts 
of attribute sets, for example. 

An equivalence relation O on p{A) is a congruence on the semilattice {p{A), U) 
if for all Ai, A 2 , Pi, ¥2 C A, X1OX2 and ¥iOY2 implies Ai U ¥10X2 U ¥2. The 
congruence class of a subset i^(C A) is B /O = {C A\ BOC}. 

If A is a flnite nonempty set and 6> is a congruence on the semilattice, (p(A), U) 
then, by the deflnition of Novotny and Pawlak, the pair V = (A, O) is called a depen- 
dence space (see e.g. [1, 2,3,4]). 

Assume U and ¥ are nonempty sets, let R e Rel(P ), and let A(C ¥^) be a flnite 
subset of functions. Let us now deflne the following two binary relations and 
on the set p{A): 

{B,C)eO% ^ Sr{B) = Sr{C)-, 

{B,C) Wr{b) = Wr{C). 
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So, two subsets of functions are in the relation (resp. G^) iff they define the same 
strong (resp. weak) preimage relation. 

By Lemma 2(d) it is easy to see that the following proposition holds. 

[ Proposition 2.]The pairs (A, 6>^) and (A, G ^ ) are dependence spaces. □ 

By the previous proposition we can now define in an information system S = 
([/, A, {Va}aGA) a dependence space with respect to any information relation pre- 
sented in Section 2. For example, the dependence space defined by strong diversity 
is Vdiv = (A, Gdiv), where Gdiv = {{B, C) G p(A)^ | div{B) = div{C)}. 

In the theory of information systems the notion of reducts is defined usually only 
with respect to the strong indiscemibility relations. In such cases a reduct of a set B 
of attributes is a minimal subset C of B, which defines the same strong indiscemibility 
relation as B. In here we define reducts in a more abstract setting of dependence spaces. 

Let V = (A, G) be a dependence space and B C A. We say that a subset C{CA) 
is a reduct of S, if C C S and C is minimal m B/G. 

In the framework of dependence spaces we can study reducts of subsets of attributes 
in information systems with respect to any type information relation, as we see in the 
next example. 

[ Example 4.]Let S = ([/, A, {Va}aeA) be an information system. Let us consider the 
dependence space Vsim = (A, Two subsets B and C of attributes are now in 

the relation 6>f if and only if they define the same strong similarity relation; that is, 
sim{B) = sim{C). A reduct of a subset B{C A) of attributes is a minimal subset C 
of B, which defines the same strong similarity relation as B. 

5 Dense Families and Matrices of Preimage Relations 

Dense families of dependence spaces are families of subsets which contain enough 
information about the stmcture of dependence spaces. In this section we shall show how 
we can find reducts of subsets by applying dense families. Moreover, we will study how 
in dependence spaces defined by preimage relations, dense families can be determined 
by using matrices of preimage relations. 

Suppose A is a set. Then each family H (C p{A)) defines a binary relation r{H) 
on p{A) as follows. 

{B, C) e r{H) iff for aWX en.BCX OCX. 

It is easy to see that r{H) is a congruence on the semilattice (p(A), U). 

Let V = (A, G) be a dependence space. We say that a family H (C p{A)) is dense 
inV if r{H) = G [4]. 

Next we shall show how we can find dense families of dependence spaces which 
are defined by strong and weak preimage relations. 

Assume y is a nonempty set and let [/ = {xi, . . . , be finite. If is a binary 
relation on Y and A (C Y^) is a finite subset of functions, then the matrix of preimage 
relations of is an n x n-matrix M(A!) = (cij)nxn such that 

Cij = {/ G A I f{xi)Rf{xj)} 
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for all 1 < z, j < n. Thus, the entry Cij consists of functions f E A such that Xi and xj 
are in the preimage relation f~^ {R) (cf , discemibility matrices defined in [7]). 

The following lemma is trivial. 

[ Lemma 3.]If A (C Y^) is a finite nonempty set of functions and = (Qj)nxn 

is the matrix of preimage relations ofR[^ Rel(y )), then for all B C A, 

(a) {xi,Xj) e Sr{B) iffB C Cij,- 

(b) {xi, Xj) e Wr{B) iffB n dj ^0. □ 

Our next proposition shows how matrices of preimage relations define dense fami- 
lies of dependence spaces. 

[ Proposition 3^.'\Assume A[<ZY^) is a finite nonempty set of functions and M(R) = 
(Qj)nxn Is the matrix of preimage relations o/R (g Rel(y )). Then 

(a) {cij \ 1 < i,j < n} is dense in the dependence space [A, 6>^); 

(b) {— Qj I 1 < ^,7 < Is dense in the dependence space [A, O^). 

[ Proofed) Let us denote TL = {cij \ I < i,j < n}. We have to show that r{TL) = 
If {B^C) G then for all 1 < ifi A n, B C cij iff G Sr{B) iff 

{xi, Xj) G Sr{C) iff C C Cij, which implies (E, C) G BiB)- Hence, C BiTL). 

If (E,C) G r(?f), then for all 1 < fj < n, {xi,Xj) G Sr{B) iffB C dj iff C C 
Cij iff [xi,Xj) G which implies Thus, alsor(7f) C and 

so r{H) = oi. 

(b) Let us denote K = {—dj \ I < i,j < n}. If [B,C) G 6>^, then for all 

1 < L7 An,BC —Cij iff B Ocij = 0 iff {xi,Xj) ^ Wr{B) iff {xi,Xj) f. Wr{C) iff 
C n Cij = 0 iff C C ~dj^ which implies {B,C) G r{JC). Hence, C r{JC). 

If (E,C) G r(/C), then for all I <ij <n, (xi,Xj) G Wr{B) iff B D Cij f 0 
iff B 2 —Cij iff C 2 —Cij iff C n Cij f 0 iff [xi,Xj) G Wr{C), which implies 
Wr{B) = Wr{C). So, also r{JC) C and hence r{JC) = G^ . 

The next proposition, which can be found in [2], characterizes the reducts of given 
subset of attributes by the means of dense sets. 

[ Proposition 4.]Let V = [A,G) be a dependence space and letTL (C p{A)) be dense 
in V. IfB C A then C is a redact ofBiffC is a minimal set which contains an element 
from each nonempty differences B — X, where X eH. □ 

[ Example 5.]Suppose S = (t/, A, {Va}aeA) is an information system, which is de- 
scribed in the following table. 





Age (years) Weight (kg) Height (cm) 


PI 


{22,.. 


.,26} (48,.. 


.,54} (154,.. 


.,157} 


P2 


(26,.. 


.,33} (73,.. 


.,78} (170,.. 


.,175} 


P3 


(24, . . 


.,29} (51,.. 


.,58} {159, . . 


.,162} 


P4 


(31,.. 


.,37} (75,.. 


.,82} {157,.. 


.,165} 



Let us denote a = Age, b = Weight, and c = Height. If R = SIM, then the 
preimage matrix of R is the following. 
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PI 


P2 


P3 


P4 


PI 


A 


{«} 


{a,b} 


{c} 


P2 


{«} 


A 


{«} 


{a, 6} 


P3 


{a,b] 


{«} 


A 


{c} 


P4 


{c} 


{a,b} 


{c} 


A 



By Proposition 3, the family H = {{a}, {c}, {a, 6}, A} is dense in the dependence 
space Vsim = {A, Next we determine the reducts of the set A in this depen- 

dence space. 

The differences A — X, where X eH, are 

A — {a} = {6, c}, A — {c} = {a, 6}, A — {a, 6} = {c}, and A — A= 0. 

The first three of them are nonempty, and clearly {a, c} and {6, c} are minimal sets 
which contain an element from these differences. Then by Proposition 4, the set A has 
the reducts {a, c} and {6, c} in the dependence space Vsim - Thus, {a, c} and {6, c} are 
minimal sets which define the same strong similarity relation as A. 

Similarly, the family 1C = {—{a}, — {c}, —{a, 6}, —A} = {0, {c}, {a, 6}, {6, c}} 
is dense in the dependence space V^sim = differences A — X, where 

X e X, are 

A — 0^A,A— {c} = {a, 6}, A — {a, 6} = {c}, and A — {6, c} = {a}. 

They all are nonempty and obviously {a, c} is the only minimal set which contains an 
element from these differences. This means that {a, c} is the only reduct of A in the 
dependence space V^sim- So, {a, c} is the unique minimal set which defines the same 
weak similarity relation as A. 

References 

1. J. Jarvinen, A Representation of Dependenee Spaees and Some Basie Algorithms, Funda- 
menta Informaticae 29 (1997), 369-382. 

2. J. Jarvinen, Representations of Information Systems and Dependenee Spaees, and Some 
Basie Algorithms, Licentiate's Thesis, Department of Mathematies, University of Turku, April 
1997. Available at http : //www.utu. f i/^j j arvine/ licence . ps. 

3. M. OTm , Applieations of Dependenee Spaees, In E. Orlowska (ed.), Ineomplete In- 

formation: Rough Set Analysis, Physiga Verlag, 1997. 

4. M. , Dependenee Spaees of Information Systems, In E. Orlowska (ed.), Ineom- 

plete Information: Rough Set Analysis, Physiga Verlag, 1997. 

5. E. Orlowska, Introduetion: What You Always Wanted to Know About Rough Sets, In E. 
Orlowska (ed.), Ineomplete Information: Rough Set Analysis , Physiga Verlag, 1997. 

6. E. Orlowska, Studying Ineompleteness of Information: A Class of Information Logies, In 
K. Kijania-Placek, j. WOLENSKI (eds.), The Lvov-Warsaw Sehool and Contemporary 
Philosophy, Kluwer Aeademie Publishers, 1997. 

7. A. Skowron, C. Rauszer, The diseernibility matriees and funetions in information sys- 
tems. In R. Seowinski (ed.), Intelligent deeision support, Handbook of applieations and 
advanees of the rough set theory, Kluwer Aeademie Publisher, 1991 . 




Cellular Neural Networks for 
Navigation of a Mobile Robot 



Barbara Siemiatkowska and Artur Dubrawski 

Institute of Fundamental Technological Research, Polish Academy of Sciences 
21 Swietokrzyska Str., 00-049 Warszawa, Poland 



Abstract. This paper summarizes applications of cellular neural net- 
works to autonomous mobile robot navigation tasks, which have been 
developed at the Institute of Fundamental Technological Research. They 
include map building, path planning and self-positioning of an indoor 
robot equipped with a laser range sensor. Efficiency of navigation based 
on cellular neural networks has been experimentally verified in a variety 
of natural, partially structured environments. 



1 Introduction 

Autonomous navigation of a mobile robot deals with the following fundamental 
tasks: identifying a free space and the objects which compose the robot’s sur- 
roundings, determining the robot’s location within its operating environment, 
planning a suitable path that leads from the robot’s current position to some 
desired location, and monitoring a safe travel along the planned trajectory. Dur- 
ing the past decade or so there has been a vast amount of research conducted in 
the field of mobile robotics. As a result, an according portion of published mate- 
rial illustrates a variety of approaches taken to solve the fundamental problems of 
autonomous navigation. In this paper we present our attempts to provide world 
modeling, path planning and self- localization capabilities to an autonomous in- 
door mobile robot, by using a methodology of cellular neural networks. 

In our experiments we use the RWI Pioneer- 1 vehicle equipped with the 
HelpMate LightRanger scanning laser range finder (see Fig. 1). The laser device 
provides planar scans covering 330^ of angular width at a rate of 660 distance 
measurements per second. The power level of the laser beam enables measuring 
distances up to 10m with an accuracy of 3cm. 

Subsequent laser scans are being aggregated into a grid-based map of the 
robot’s environment. The map may be built from scratch, using the series of 
immediate range measurements only, or it may be an effect of combining the 
current readouts with an a priori model of the workspace. In any case, the map 
is represented as a two-dimensional, rectangular grid of square cells. Each cell 
may be /ree, occupied by an object (perhaps an obstacle) or its status may be 
unknown. The actual status of a given cell depends on the set of current readings 
provided by the range sensor and on the prior state. The status update rules are 
based on Bayesian approach to data aggregation. 

Path planning module is realized in a form of a two layer cellular network, 
which implements a diffusion (also known as a wave-front) algorithm. The first 
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layer neurons are first exposed to the external signals provided by the map cells, 
which carry the occupancy information. Also, the neuron which corresponds to 
the goal location is purposively biased. Then the neuron attached to the current 
robot’s position is activated and the signal wave propagates through the network 
from there until it reaches the goal neuron. The steady state activation levels are 
then propagated to the second layer of the network, which performs a downhill 
search for the shortest path linking the goal and the robot. The planned path is 
then represented as a sequence of the map cells to follow. 




Fig. 1. Mobile robot RWI Pioneer- 1 equipped with HelpMate LightRanger scan- 
ning laser range finder. 

Relevance of the planned path depends very much on the accuracy and rel- 
evance of the map. All the process can be distorted very seriously, if the actual 
location of the robot at the scan collection points were substantially different 
than their estimates. Some way of providing accurate fixes of the current posi- 
tion, and most importantly the current orientation of the robot, is a neccesary 
prerequisite of efficient autonomous navigation. Our approach to orientation and 
position tracking is well suited to the range-based navigation, and it is imple- 
mented in a form of a layered, circular cellular neural network. 



2 Cellular neural networks 



The concept of cellular neural networks was introduced by O. Chua and L. 
Yang in 1988 [1], and since then these structures have found many applications, 
especially in signal and image processing [2,3]. A generic cellular neural network 
is an n-dimensional array of identical processing units. All interactions between 
the cells are local within a finite radius, and interconnection effect is represented 
by the equations (a 2-D case): 
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k=—rl=—r k= — rl=—r 

Vij — fi^ij) ( 2 ) 

where Xij is a state, i/ij is an output and Uij is an input of a cell (ij), r is 
a neighborhood range, / is a scalar constant, and represent 

weights of the interconnection between the cell {ij) and its {kl)-th neighbor, and 
/(•) is some nonlinear, bounded function. Note, that because of the propagation 
effect the cells which are not directly connected can still affect each other. 

Usually values of and are memorized in matrices A and 

and they are symmetrical (ie. and The 

triple (A,B,I) is called a cloning template. Most networks are spatially invariant, 
ie. all cells in one network have the same cloning template. 



3 Map building and path planning 



A grid-based map of the robot’s workspace aggregates a prior knowledge about 
the location of the objects and geometry of the traversable area, with a posterior 
information in the form of range measurements provided by the on-board sensors. 
The aggregation formula follows Bayes rule: 



Pij = P{o I ij) = 



Pjij I o) ■ Pjjjo) 

P{ij I o) • P,,(o) + [1 - P{ij I o)] • [1 - P,,(o)] 



(3) 



where pij is the aggregate probability that the map cell {ij) is occupied, P{o \ ij) 
is the posterior probability of the occupancy given the location (ij), P{ij \ 
o) is the prior (possibly a result of a previous aggregation), and Pij{o) is the 
immediate information provided by the sensor. We assume that all the cells 
located on the way of a laser beam measuring some distance r are free for 
almost sure {Pij{o) = 0.1), the cell corresponding to r is almost surely occupied 
{Pij{o) = 0.9), and the cells located behind that one have their immediate status 
unknown and are not subject to update. 

For path planning purposes we use a two layer cellular neural network. The 
geometry of both layers is identical and it corresponds to the geometry of the 
map. Each cell of the map (or, more practically, of a cut-off of the map) has 
its counterpart in the first layer of the planning structure, and there is one 
corresponding unit in the second layer. The interlayer connections of the planning 
network have one-to-one topology. In both layers the neighborhood range equals 
one, ie. each neuron is interconnected only with its immediate neighbors of the 
same layer. 

Each unit of the map sends a binary signal sij to the corresponding neuron 
of the first layer of the planning network. A basic scheme of computing the 
values of Sij would set it to one if the cell {ij) is free or unknown and to zero 
otherwise. But, the map raster dimension is taken smaller than the robot’s size. 




150 B. Siemiatkowska, A. Dubrawski 



so the sensed objects (the obstacles) dimensions are expanded a little bit to 
secure a margin for a safe travel. The following obstacle expansion function is 
then introduced: 

_ J 0 ^kieNij < d 

1^ 1 i/ h{pki) > d 

where /i(-) is a threshold function (in our case h{x) = 0 if x < 0.4 and h{x) = 1 
otherwise), d is a half of the largest dimension of the robot (in pixels) and Nij 
is a set of neurons belonging to the neighborhood of the cell {ij). In this way, 
the input to the planning network unit {ij) equals Sij = 1 if the map cell {ij) 
and all of its neighbors are unoccupied, and Sij = 0 otherwise. 
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Fig. 2. Planner’s cloning templates: the first (left) and the second layer (right). 

The cloning templates of the first planning layer are depicted in the left part 
of Fig. 2. The values of a and /? are smaller than one and a > f3. The initial 
states of the cells are all set to Xij (0) = 0, except for the cell corresponding to the 

goal location, for which x[j'^^^\o) = F, where F » 1. The interaction between 
the neighboring cells is described by the following discrete-time equations: 

Xij(t + 1 ) = Sij ■ max (afj ■ yki{t)) V] hfj ■ Uki{t) ( 5 ) 

Vijit + 1) = Xij{t) (6) 

The above formulas induce the following properties: 

1 - ^ {ij and t) ^ij{^) — 

2. If the cell {ij) is occupied by an obstacle, then Xij{t) = 0. 

3. If i/^j(t)=const, then Xij{t) < Xij{t -h 1) 

4. The network is stable, ie. \^ij{^) ~ ^ij{t + 1)| < ^ 

The signals propagate until the first layer of the planning network reaches a 
stable state: 

\fij Xij{t) = Xij{tFl) ( 7 ) 

The topology of the second layer is analogical to the first layer. Respective 
neurons of the first and the second layer are connected one-to-one. Values of the 
respective interlayer convection weights are set to one, so the initial activations 
are equal to the values of the units’ input signals: x\-\t) = u\-\t)^ where x\-\t) 

is the second layer neuron activation at the moment t, and u\-\t) the value of 
its input signal. Cells of the second layer of the planner have cloning templates 
shown in the right part of Fig. 2. 

When the second layer neuron which corresponds the robot’s location on 
the grid-based map receives a signal form its counterpart in the first layer the 
path generation process begins. The path starts at the robot’s position, the 
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Fig. 3. Stages of path planning. 

next position is indicated by the neighboring neuron which has the smallest 
activation, and so on. The process of path generation continues until the goal 
position is reached or until none of the neurons change their activation. The 
second condition means that there is no free path to the goal. 

The experimental verification of the feasibility of the presented map building 
and path planning method revealed a few of its practical advantages. Unlike 
many other techniques of planning trajectories for indoor mobile robots it does 
not suffer from local minima problems. In particular dead-end corridors are easily 
recognized and avoided. Also the ”no way” case, in which the goal is completely 
surrounded by the obstacles does not lead to an oscillatory behavior of the robot. 
Moreover, the presented method may be mapped into a parallel hardware, which 
would provide processing frequencies sufficient for reactive navigation. Figure 3 
presents the subsequent stages of map building and path planning. The robot is 
depicted as a big black dot, and the goal location is shown as a small dot. First, 
the map is built (or updated) using a recent set of laser range finder indications. 
The gray pixels represent the map cells corresponding to the free space, black 
ones represent parts of the obstacles, the status of the white cells is unknown. 
First picture presents the initial situation, then the map of the environment 
is built, the path is planned, the robot is moving along the path, while it is 
continuously updating the map. When a new obstacle is detected at a colliding 
location, the path may be replanned on the fly. 

4 Self- localization 

A scanning range sensor collects scans^ ie. sets of m readouts (pi}^ where Ri 
represents an observed distance to an object placed in the way of the laser beam 
in the direction determined by a scanning angle pi. The scanning angle takes m 
discrete values ranging from 0° to 360° (in general; in our case the working range 
of the scanning angle is limited by the hardware to 330°), so that i = 1, . . . , m 
and Ap = Pi+i — Pi = const^ and the indices i are additive modulo m. 
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Fig. 4. Readout {Ri^cpi} is a candidate member of a segment of five collinear 
readouts (neighborhood n = 2) oriented along a direction o; (left). Cyclic cellular 
neural network in a setup suitable for processing laser range scans in order to 
obtain segment orientation histograms (right). 

We may write a normal equation of a line in the plane {x^y) oriented along the 
direction a as follows: x • sina—y -cosa-\-c = 0, where | c \ determines the distance 
between the line and the origin. By taking into account that Xi = Ri- cos pi and 
yi = Ri • sin(y^^ we may determine the distance Ci for the line which crosses a 
given point (xi.yi): 

a= Ri - sin{a - pi) (8) 

Two points {Ri^pi] and {Rj^pj} belong to the same line oriented along o; if 
Ci = Cj. This is the essential idea for testing colinearity of the neighboring range 
readouts in the presented method. We may say that the readout {Ri^pi} belongs 
to a line segment oriented along a specified direction a (see left part in Fig. 4) 
if it and its 2n neighbors fulfill the following condition: 

W 1 

V I i ^ I , . 

i-n<k<i+n \ ~ — / Cf \< € (9) 

t=i—n 

So, for the colinearity check purpose, it is sufficient to compare the neighboring 
c- values computed for a particular o; using the set of current readouts {Ri^pi}. 

In the neural implementation of the described stage of processing (right part 
of Fig. 4), the first ring-shaped input layer is composed of processing units which 
model the equation (8). Each of the m units receives a separate input signal Ri 
and a common input a. The piS are hardware dependent and may be assigned 
constant values, one for each of the input layer’s units. The second layer is com- 
posed of m laterally interconnected neurons which receive signals proportional 
to c^’s computed by the units of the input layer. The neurons take into account 
signals received from their neighbors (of course in general the range of lateral 
connections may be larger than n = 1) and check the condition (9). A second 
layer neuron returns 1 if its colinearity condition is met and 0 otherwise. Con- 
secutive I’s in the ring-shaped output vector from the second layer of processing 
units represent wall segments oriented along the specified a with the accuracy 
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specified by e. Zeros represent the components of the given scan, which do not 
seem to be the members of line segments parallel to the direction a. 





Fig. 5. Left: histograms N{a) with n = 1 and e set to subsequently 20, 10, 5 
and 3 millimeters. Right: normalized crosscorrelations of a pair of histograms 
computed for the two consecutive scans taken enroute; solid line is a result 
obtained using the method described in this paper with n = 2 and e = 10mm, 
dashed line was obtained with angle histograms implemented after [5] with n = 7. 

We compute sets of c values for all the range of possible line segments orien- 
tations, ie. from 0 to +180 degrees in the robot’s coordinates. For each o; from 
that range we sum up I’s present on the outputs of the second layer of neurons, 
and then we plot these sums, denoted N{a)^ against o;’s. In such a way we obtain 
a sort of angle histogram, which actually represents the cummulative dimensions 
of scene segments oriented in particular directions as seen by the sensor. The 
histograms, shown in Fig. 5 for n = 1 and several values of e, position their 
extrema at the locations of the most and the least predominant directions of the 
scene segments. In a typical structured or partially structured indoor environ- 
ment there are usually two maxima, which correspond to (most often) roughly 
perpendicular directions. 

If we compared two histograms computed for the scans taken at nearby loca- 
tions, we could expect them to reveal changes in position and orientation of the 
viewpoint in respect to the elements of the scene. The difference in o;’s which cor- 
respond to the maxima of the histograms would be roughly equal to the robot’s 
orientation change if the scans were being collected sufficiently often. But ap- 
parently, computing a maximum of a normalized cross-correlation of the two 
histograms is a much more robust way of tracking orientation changes (location 
of a maximum of the crosscorrelation curve indicates the orientation change of 
-2° for a sample pair of histograms, as shown in the right part of Fig. 5). It is 
possible to track changes of the robot’s position in the same framework. Due to 
the space limits we have to skip a discussion on that topic here, but an interested 
reader will find more in [4]. 

In [5] there is presented a technique of tracking orientation and position 
called angle histograms. In that method the histograms reflect local gradients of 
orientation of the lines crossing the subsequent components of the scans. Our 
method reflects the cumulative lengths of colinear segments, which are present in 
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the scan, instead. The experiments conducted in a realistic indoor surroundings, 
composed of the wall segments made of various materials, painted in various 
colors and of various surface textures and geometries, show that our technique is 
substantially more robust against measurement noise, than the method described 
in [5] (Fig. 5). 

5 Conclusion 

In this paper we presented cellular neural network based approaches to map 
building, path planning and self-localization of a mobile robot working in par- 
tially structured environments. The presented methods are especially suitable 
for use with optical scanning range finding devices. 

Experiments performed with a real robotic vehicle in a realistic indoor en- 
vironment revealed computational efficiency of the discussed methods. The pre- 
sented self-localization technique may serve as an add-on to the existing au- 
tonomous navigation systems. It can provide high accuracy and efficiency of 
on-line map building and navigation in known or unknown, partially structured 
environments. Path planning method has some important advantages too. In 
case of parallel realization it would provide processing frequencies in a range 
typical to reactive the control schemes. Moreover it does not suffer from local 
minima, and it does not fail when the goal is unreachable. 
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Abstract - This paper presents the TS model identification method by which a great number of 
systems whose parameters vary dramatically with working states can be identified via Fuzzy 
Neural Networks (FNN). The suggested method could overcome the drawbacks of traditional 
linear system identification methods which are only effective under certain narrow working 
states and provide global dynamic description based on which further control of such systems 
may be carried out. Simulation results of a second-order parameter varying system 
demonstrate the effectiveness of the method. 

Keywords - Parameter Varying Systems, TS Fuzzy Model, Fuzzy Neural Networks (FNN), 
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1 Introduction 

Controlled systems whose parameters vary dramatically with working states, namely 
parameter varying systems, are widely encountered in practical industrial situations. 
Although traditional linear system identification methods have been well established in the 
last twenty years, it can only be used under a certain narrow range of working conditions. 
Moreover, traditional controllers based on such models can not cope with the changes on 
process dynamic effectively. Therefore, developing global dynamic model and 
establishing the corresponding control schemes for the parameter varying systems are 
desrirable. 

Takagi and Sugeno [2,3,4] proposed a new type of fuzzy model (TS model) which has 
been widely used. The model provides succinct description of complex systems, and is 
convenient for designing controllers. Recently, the authors [5] suggested an identification 
method of TS fuzzy model for nonlinear systems via Fuzzy Neural Networks (FNN). It is 
very effective in describing systems. In this paper, the TS fuzzy model is generalized to 
the parameter varying systems, and an identification method based on FNN is presented. 
Simulation results of a second-order system illustrate and verify the effectiveness of the 
method. 
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2. TS Fuzzy Model 



Parameter varying systems which possess m working state characteristic variables, q 
inputs and single output can be described by the TS fuzzy model consisting of R rules 
where the i-th rule can be represented as: 

Rule/: ifz, isX'’^' ,z. isv4'’^' ,---,andz isX'’^" 

1 I’z z’’ m m 



theny' = a\x^ + a\x^-\ \-a x 



( 1 ) 



/ = l,2,---,7?. kj =l,2,---,r.. 

where R is the number of rules in the TS fuzzy model, z. {j = l,2,--,m) is the j-th 
characteristic variable on the working state of the systems and can be selected as input, 
output or other parameters of the system, x^ (/ = 1,2,- is the 1-th model input, y' is the 

i k 

output of the i-th rule. For the i-th rule, Aj ' is the kj -th fuzzy subset of the domain of z j . 
a\ is the coefficient of the consequent, r. is the fuzzy partition number of the domain of 

z k 

Zj . This rules says if the variables Zj of the working state stay in the domains A’ ' as 
described, then the output y' and input x^ (/ = 1,2,---,^) are related by (1). For simplicity, 
we write r.=r . r is determined by the complexity and accuracy of the model. Once a set 

of a set of working state variables (z^q,Z 2 q,---,z^q) and model input variables 
(xio,X 2 o,“*,v^o) available, then the output of TS model can be calculated by the 



weighted-average of each y' : 

( 2 ) 

i=\ / i=\ 

where y' is determined by consequent equation of the i-th rule. The truth- value ji of the 
i-th rule can be calculated as [1, pp. 382]: 



7-1 



(z/o) 



(3) 



Equation (2) can be rewritten as: 

R ^ 






(4) 



V/=1 /=1 Jl /=1 

From (4), one can see that the TS fuzzy model can be expressed as an ordinary linear 
equation. As ji varies with working state, TS fuzzy model is a coefficient- varying linear 
equation. For all possible varying ranges of working states, the TS fuzzy model reflects the 
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relationships between model parameters and working states. The global dynamic 
characteristics of the parameter varying systems is represented. 



3. Fuzzy Neural Networks TS Fuzzy Model Identification Method 

A. Structure of the FNN 

According to (1~3), the structure of FNN presented here consists of premise, 
consequent and fuzzy inference. For systems which posses m working state characteristic 
variables, q inputs and single output, the FNN used for the TS model identification is 
shown in Fig. 1. The circles and the squares in the figure represent the units of the 
networks. The notations between the units denote the connection weights. The units 
without any notation just deliver the signals from input to output. 

1) Normalization of the working state variables 

Layers (A)~(B) of the FNN are used to normalize the working state variables in case of 
saturation of the premise nodes. Assuming P samples = 1,2,--*,P) are 

available for training the networks, the j-th working state variable of the p-th sample can 
be normalized as: 

where zj is the normalized working state variable of z J ; (wj and (wj are the 
coefficients and biases of normalization respectively: 

[^s)j= 

max 
max 







(z;)-min(z;) 

(z;) + min(z;) 
2 



2) Premise 

The premise parts of the FNN include Layers (C)-(F) which are used for fuzzy partition 
and truth- value calculations. Signature ‘ X ’ in layer (D) ,which is the sum node, realizes 
the following operations for the k-th fuzzy subset of z . : 
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Signature ‘ A ’ in layer (F) is the fuzzy minimum node and the input-output relationships 
for the i-th rule can be written as: 




m 

= min 

j=\ and A:=^(/j) 




( 7 ) 




= z = 1,2,---,7?.7 = l,2,---,w. l,2,---,r 

where and are input and output of the nodes which correspond to the k-th fuzzy 
subset of z. in layer (•) respectively; ^ and are input and output of the nodes which 
correspond to the i-th rule in layer (•) respectively; the central point and gradient of the k- 
th fuzzy subset for z. are determined by both and (wj represents the 

connective relationship between the i-th rule and the k-th fuzzy subset of z.. The 

membership functions of the working state variables are determined by activation 
functions of the nodes in layer (E). In this paper, the following activation functions are 
taken: 

+ , k=\ 

f^{x) = ^ , ^ = 2,3,---,r-l (9) 

l/(l + e'') , k = r 

which realize fuzzy partition as shown in Fig. 2. 

3) Consequent and fuzzy inference 

Layers (G)-(J), which are used to implement the linear equations of the TS fuzzy 
model, are consequent parts of the FNN. As for the i-th rule of the consequent, input- 
output relation realized can be written as: 

dy=ijyW^) Xj ( 10 ) 

J=1 

where (w^) . . is the coefficient of x. in rule i . 
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Layers (K)~(M) realize the fuzzy inference as shown in (2). 

B. Learning algorithm 

Parameters and which determine central points and gradients of the 

membership functions in the premise, and . which determines local linear 
relationships of the consequent are need to be learnt by the FNN. Assuming P samples 
(zf ,Z 2 ,X 2 1,2,---,P) are available for training the FNN and the 

corresponding teacher signal is f . Once the p-th sample is put on the networks, the actual 
output of the networks can be obtained. Thus, the learning error function of the sample 
can be defined as: 

E’’=\r-y^)" ( 11 ) 

2 

Under this definition, the total error function of all the samples can be written as: 



E = Le” ( 12 ) 

p=\ 2 p=\ 

According to the Gradient-Descent learning algorithm, one can obtain: 
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In order to solve j in (14) , the following equivalent transition for (7) is 

needed: 



min 0%^= y 

J=Unik=^(ij) ’ 






/=land /^7 



(15) 



where: 

Therefore, dd'p^ j dd'^l^ can be calculated by 



1 , d^l <o!f 

’ J,k l,k 

0 6>‘^’ > 



(16) 



gQ{E)p VVAdOj,k ^l,k ) L ^(E)p^^(E)f 

^j,k /=land/^/ P ’ >^!,k 

Moreover, j can be obtained from (10) as follows 



(17) 



dl%^” 

J,k 



^^j,k ■ ^j,k 



0\ ^ 
J,r 



, k = 2,3,---,r-l (18) 



k = r 



■(‘-oT') 

(14), (17) and (18), <®/<?(«'g) 

Using the same method mentioned above, dEj d{w ^ ) ^ can also be represented by 



From 



can be obtained. 



dE 



■=z 



dEP dyP 



ddc)i,k 



dEP 



dyP 



z , 

i=\mdk=(l}{i,j)Y_ 






) z , 

i=\dLndk=(l}{i,j)y_ 



dyP_ 

(ofdE 



oofdp ' dof)p 



dlfk^ 

ddp)p 



IS- 



(19) 



p=\ 

p 

=z 

77=1 
P 

=-z 

p=\ 

X)f/ ] 

”'1 

Therefore, the final tuning equations of the premise and consequent parameters of the 
FNN can be written as: 



{^a )y (« + !) = {w „), . («) - ^ • dEjd[w^ )^. , (20) 
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(m.J (« + !) = («) - ^ • (21) 

(« + !) = ) .^ («) - ^ • dEj J (22) 

where ^ is the training times; ^ and ^ are learning rates. In this paper, we use the 
adaptive back-propagation algorithm suggested by the authors [6]. 



4. Simulation Example 



Considering the following second-order parameter varying system: 

1 

— = T (23) 

u{s) (l + Zy) 

where the time constant T is affected by a working state variable z 



(z£ [03,0.9]). 



Suppose the relationship between them is: 

r = 20 + 20-(z-0.3) (24) 

Once the sample time is given, the discrete time description of the system could be 
obtained: 

y{k)= [2 • + r • T, )• y{k -\)-T^- y{k - 2) 

+ ■u{k-l)y{T + T,f 

In this paper, the sample time 7J, is taken as 5 seconds. Curves I, 2 and 3 in Fig. 3 show 
the unit step response of the system at z = 0.3, z = 0.6 and z = 0.9 respectively and one can 
see that variations between the different z ’s are very large. 

Using the suggested FNN TS model identification method, we select Z as a working 
state variable for the input of premise in the FNN and take u{k - l) , y(k - 1) and y(k - 2) 



as input variables for the TS model. The aim of the identification is to obtain the global 
model which is suitable for all the possible working states of the system. First, ten states 
are selected randomly, and 310 groups of training data are obtained by exerting 5-order M 
sequels which have the range of I on the system. All of the weights in the consequent of 
the FNN are selected between -O.I and O.I randomly, and the fuzzy partition number r is 
selected to 7 as shown in Fig. 2. In order to fasten the convergence rate of the networks, 
the following parameters are used as the initial value of the adaptive BP algorithm shown 
in [6]: 

C(l) = 0.9, ^(1) = 0.4 , a, = 1.4 , = 0.6, = 0.5 

The final convergence conditions are taken as: 

1) the number of the samples which have satisfied -y^^jf < 0.05 has exceeded 95 
percent of the total samples. 

2) Training times has exceeded the maximum times specified as 10000. 
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After training the FNN 868 times, the networks converged by satisfying condition 1) and 
the final simulation results are shown in Fig 4 and Fig. 5. As shown in Fig. 4, where the 
solid line and the dotted line denote the expected output of the system and the actual 
output of the networks respectively, most of the samples have good performance to 
describe the actual outputs of the system. Finally, we use another ten groups of z to verify 
the performance of the resulted FNN TS fuzzy model and the results are shown in Fig. 6. 
The same conclusion can be drawn from it. Therefore, the suggested TS model 
identification method is strongly effective in obtaining the global dynamic model of 
parameter varying systems. 



5. Conclusions 

This paper generalizes the TS model to the parameter varying systems and presents the 
corresponding identification method via FNN. The proposed method can effectively 
realize the identification of parameter varying systems whereas the traditional linear 
system identification methods can not. Furthermore, control of such systems based on the 
well-established TS fuzzy model can be carried out and this further research fields creates 
for us. Simulation results of a second-order parameter varying system have fully verified 
the effectiveness of this method. It should be noted that a more effective way is provided 
to establish fuzzy control rules for the multi-working-states situations. Based on the model, 
performance of fuzzy controller will be greatly improved under such situations. 
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Abstract. It is important to design intelligent welding robots to obtain a good 
quality of the welding. It is required to detect bead height and deviation from 
center of gap and to control arc length and back bead by adjusting welding con- 
ditions such as welding speed, power source voltage and wire feed rate. Authors 
propose arc sensor using neural networks to detect arc length and wire exten- 
sion. By using them and by means of geometric method, the bead height is de- 
tected. Moreover, authors propose switch back welding method to get stable 
back bead. That is, welding torch is not only woven in the groove, but also 
moved backward and forward. 



1 Introduction 

Arc welding process plays one of important modem technologies to join metal 
plates . It is important to design intelligent welding robots so as to obtain a good 
quality of the welding . It is also important to control arc length regardless of distur- 
bances such as variations of the torch height, feed rate and so on. The method is dis- 
cussed to control arc length by adjusting the power source characteristic. If groove 
gap is narrow, the welding speed is faster and the current is bigger than the groove 
gap is wider. The power source characteristic of the constant voltage is used. If the 
gap becomes wide, the pulsed current is applied to the welding and it is made by 
using the power source with rising characteristic. 

Authors propose the switch back welding method to obtain a stable back bead. The 
welding torch is not woven with lOHz on the groove, but also moved backward and 
forward. Therefore it is required to detect groove gap. During forward process, edges 
of the groove gap are melted and the droplet metal is deposited to the edges, so that 
the weaving width of the torch is equal to the gap. During backward process, the 
deposited metals bridge base metals. 

If the weaving center of the torch is just at the center of the groove gap, the wave- 
forms of the voltage and the current on right side and left side are symmetric about 
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the center of the gap. Sums of voltage or current on the right side of the weaving are 
equal to that on the left side. If there is difference of the sums, the weaving center is 
not at the center of the gap. In conventional arc sensor, the difference of the sums 
between right and left sides is used to detect the deviation from center of the gap. It is 
difficult to detect the gap by the conventional arc sensor. 

In order to find the gap, authors propose neural network (N.N.) arc sensor which 
uses the sampled data of both current and voltage at the current pick-up point. That is, 
the extension of an electrode wire and the arc length are found by the N.N. The 
groove shape is estimated by outputs of the N.N. The training data are constructed 
from the experimental results. The performance of the N.N. is examined by using the 
testing data. 



2 A proposal of switch back method to get a stable back bead 

By using the knowledge of weld pool phenomena, authors have developed a new 
welding method of controlling the torch motion. Figure 1 shows the switch back 
welding method. The weaving width depends on the gap. If root gap is 4mm, the 
torch is moved forward 18mm with 90 cm/min and backward 9.5mm with 22 cm/min 
in one cycle. This motion is repeated and in this example the average welding speed is 
13.4 cm/min. During the forward process, the amplitude of torch weaving of lOHz is 
controlled to be equal to the root gap, such that the heat input and droplet are given to 
each root edge of the base metals. The pulse peak is made after passing the weaving 
center to transfer the droplet to each root edge of both base metals in the case of rela- 
tively wide root gap. In Fig.2, top of the electrode wire is near the root edge of the 
groove, where the current is synchronized with pulse duration. During the backward 
process the back weld pool tends to become big enough to get a good back bead. 
Before the molten metal bums through, the torch should begin to move away forward 
from the weld pool. Welding speed in forward process is faster than that in backward 
process. 



Backward stroke 




Fig. 1. Switch-back welding method. 
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Welding torch 




Fig. 2. Relationship between voltage, torch position and pulse duration. 

3 Estimation of the arc length and the wire extension length by 
the neural network 

The voltage of the current pick-up point ( the torch voltage ) consists of the arc volt- 
age and the voltage drop in wire extension. The arc voltage may be experimentally 
decided by the arc length and current. The voltage drop in wire extension may be 
determined by the resistance and inductance at each part of the extension and its 
length which depends on the wire feed rate, the melting rate at the top of the wire and 
the temperature or current given till that time. In discrete time system, the phenomena 
( the current and the voltage ) at the present sampling instance may be described by 
the difference equations concerning the current, the voltage, the arc length, the wire 
extension and wire feed rate at previous sampling instances. We tried to construct the 
N.N. for obtaining the arc length l{k) and the wire extension length L{k) at the present 
sampling instance kT (T: sampling period 1ms) from the current i{k) - i{k-A9) and the 
voltage v{k) - v(^-49), the wire feed rate vfji) - v^^-49) at the present and previous 50 
sampling instances in the past time 50 ms as illustrated in Fig.3. Number of units in 
input layer is 150. Number of units in the hidden layer is 15. We tried to train the 
N.N. as shown in Fig.4 by means of the back propagation method and by using the 
training data obtained in experiments, of which one example is shown in Fig. 5. 
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Welding Torch 




Fig. 3. Relationship between sampling instant of current and voltage, 
and torch position. 
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Fig. 4. Neural network to estimate arc length and wire extension length 
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In Fig. 5, the arc length and the wire extension length are investigated by the mo- 
tion picture taken by high speed video CCD camera. Figure 5 shows the experimental 
results in the cases where the root gap is 4mm and the center of torch weaving agrees 
with the center of the groove. Figure (a) and (b) show the experimental results in the 
forward process at (a) in Fig.l and in the backward process at (b) in Fig.l, respec- 
tively. 

While the torch weaves in the groove, the arc discharge between the top of the 
electrode wire and the base metal or weld pool surface through the shortest path. 
However, when the temperature of the base metals is low and the gap between both 
base metals is wider than 3mm, the arc continues to form in one side of the base metal 
until the top of the wire come close to the opposite base metal as illustrated in Fig. 6. 

In the forward process , the neural network ( arc sensor) detects the torch position 
in the groove to perform the seam tracking and the N.N. also detects the groove gap to 
determine the weaving width of torch. 

In the backward process, the N.N. measures the bead height in the groove to in- 
vestigate whether both edges of the base metals bridge or not with the deposited metal 
and to control the welding conditions such as the current waveform, wire feed rate, 
power source voltage and welding speed. 

The outputs of the trained N.N. are also shown in Fig. 5. Good results of the esti- 
mation of the arc length ^ and extension under the training data and testing data are 
obtained. 



4 Estimation of the gap and deviation from the groove gap center 

During the forward process, the arc discharges with a right angle from the base metal 
surface. The angle of the groove is 60 degrees. The point on the surface of the base 
metal, from which the arc discharges, is estimated from the arc length and the exten- 
sion. 

Let the origin of coordinate x be the center of the torch weaving, and origin of the 
coordinate y be the end of the wire. First, coordinate of the top of the wire is found 
from the torch position calculated from encoder pulses and the extension found by 
N.N. Secondly, the circle is drawn with radius 4 corresponded to the arc length. The 
surface of the groove corresponds to the tangent line of the circle. If the arc dis- 
charges to the left side of the groove, the coordinate (x, y) of the tangent point of the 
arc on the left side is calculated by using the following equation. 

X = tp-lj cos 30° 

( 1 ) 

y = Lk + Ik sin 30° 

where tp is torch position. 

On the other hand, if the arc discharges to the right side of the groove, the coordi- 
nate (x, y) of the tangent point of the arc on the right side is calculated by using the 
following equation. 
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X = tp + lj cos 30° 

( 2 ) 

y = Lj^ + I sin 30° 

The groove shape is found by using estimated points. Since the length from the bot- 
tom of the base metals to the torch is 20mm, the coordinate of root edge is calculated. 
The deviation from the weaving center to that of the root edges is found. The surface 
of the groove is calculated by using estimation result in Fig.5 (a) and is shown in 
Fig. 6. The dot corresponds to the estimated surface and is close to the groove surface. 
The coordinate of the root edges is found. The deviation D is calculated and is almost 
equal to the value found by high speed video image. 

During the backward process, the surface of the weld pool is found. First, the coordi- 
nate of the top of the electrode wire is calculated by using the extension and torch 
position. Secondly, let the center be the top of the wire. The circle is drawn with the 
radius corresponded to the arc length Ik. Thirdly, the envelope of circles is drawn and 
corresponds to the surface of the weld pool. By using estimation results, the surface of 
the weld pool is calculated and is shown in Fig.7. The bead height can be found from 
the coordinate of the root edges and the weld pool surface. 



5 Robotic welding system for back bead control 




Fig.6. Estimation result of groove. 




Fig.7. Estimation result of weld pool surface 
when gap is 4mm. 
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It is important to give the pulsed droplet to each root edge of base metals alternately 
in one cycle of torch weaving. For this purpose, the center of torch weaving should be 
adjusted to the center of groove, and the arc length should also be controlled. The 
control block diagram is shown in Fig. 8. The control of torch position and arc length 
are performed by giving the pulse number to manipulator and by changing the 
voltage of the power source, respectively. 
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Fig.8. Control system using neural network. 
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6 Conclusions 

The authors proposed the new welding method to get the back bead regardless of the 
variation of the groove gap. The welding torch is moved backward and forward like 
the switch back accompanied by torch weaving. During the forward process, the 
droplet metal is deposited by pulsed current to each root edge of both base metals in 
the case of relatively wide root gap. During the backward process, the base metals 
bridge and the back weld pool tends to become big enough to get a good back bead. 

The amplitude of the torch weaving is controlled according to the gap. The arc 
length and the wire extension length are estimated by using the neural network sensor, 
which takes the current and the voltage as the input. By using the estimated value and 
the geometric method, the gap is calculated. The pulsed transfer metal has been de- 
posited to the root edges of the base metals by the pulsed current synchronized with 
the torch motion. The stable and wide back bead has been obtained. 
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Abstract. This paper deals with problems eoneeming traffie signal eontrol. It 
is important to reduee waiting time of eaeh ear on the interseetion, in order to 
avoid traffie eongestion. Authors apply multi-layered fuzzy inferenee to the 
traffie signal eontrol. First, number of ears is deteeted by using sensors set on 
road. Elapsed time is measured after ehange of signal from blue (go) to red 
(stop). Seeondly, the degree to ehange the traffie signal is inferred by using 
multi-layered fuzzy inferenee of whieh the inputs are the elapsed time and 
number of ears. A performanee of a fuzzy eontroller depends on the fuzzy 
variables and the eontrol rules. In this paper, the eontrol rules are eonstrueted 
from expert’s knowledge . Numerieal simulations are earried out. A good 
performanee is obtained by using multi-layered fuzzy eontrol. 



1 Introduction 

It is important to improve traffic congestion in urban traffic. The degree of the traffic 
congestion depends on performance of the signal control and urban planning. 
Moreover, traffic volume changes by time and date. If the method in which cycle of 
changing the signal is fixed is applied to the signal control, a good performance is 
obtained in case where there is no variation of the traffic volume. If there is 
unexpected variation of the traffic volume, the waiting time of the car becomes long 
and there is traffic congestion. In order to improve the situation of the traffic, the 
traffic volume is detected. According to it, the traffic signal is controlled. 
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For this purpose, authors apply fuzzy control using expert’s knowledge to the 
traffic signal control. The performance of the fuzzy inference depends on the fuzzy 
variables and the inference rules. When the conventional fuzzy inference is used, 
many rules are required. The authors propose the multi-layered fuzzy inference to 
easily present the skills and the experiences of the expert. The tuning of the fuzzy 
variables is performed by using the steepest decent method. In order to evaluate the 
control performance, the performance function J defined as the sum of the waiting 
time is introduced. The optimum fuzzy variables are calculated by the steepest descent 
method, so as to minimize the performance function J. The validity of the proposed 
control system is verified by carrying out the numerical simulation. 



2 Modeling of Intersection 



The intersection is shown in Fig. 1. There are two roads from east to west and from 
north to south. The road is composed of 3 lanes. In Japan, cars run on the left lane. 
The traffic signal of each lane has 5 kinds of sign as follows: red, blue, yellow, blue 
for right turn, red for right turn. Two detectors are set on each lane to count the cars. 
The distance between the detectors is 150m. Cars enter into the intersection according 
to random numbers in simulation. In the intersection, there is no passing. 
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Fig. 1. Intersection. 
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3 Multi-layered fuzzy control for traffic signal control 

The controller is shown in Fig. 2. In the fuzzy inference, the degree to change the 
signal is inferred. In the decision section, if the degree is over a threshold, the signal is 
changed from red to blue. 



Inputs 




In order to make the control rules, the laws to change the signal are considered as 
follows: 

1 . If there is no car on the lane of the blue sign, there is a few cars on the lane of the 
red sign and the elapsed time from the change to the red sign is not short, then the 
signal is changed. 

2. If cars on the lane of the blue sign are few and the elapsed time is short, then the 
sign is holding. That is, because this case corresponds to the beginning of moving 
of the cars. 

3. If the elapsed time is long, the number of the cars on the lane of the red sign is 
greater than that on the lane of the blue sign, then the signal is changed. 

By considering the above mentioned, the following input variables are selected. 

1. T : Elapsed time from the sign change to red. 

2. G : Number of cars on the lane of the blue sign. 

3. Wj : Number of cars on the lane of next blue sign. For example, when the signal of 
lane is blue, the signal of the lane becomes blue after the signal of Exchanges. 
Number of cars on the lane is W^. 

4. W 2 : Number of cars waiting the change of the signal except for the lanes of blue 
sign and next blue sign. 

The relationship between the degree C to change the signal and T, G, Wj and W 2 
are represented by using the linguistic rules shown in Fig.3. Fuzzy rule is composed of 
4 layers. The authors represent T, G, Wj and W 2 by using three kinds of linguistic 
representation (membership function of Large, Medium, Small) illustrated in Fig.4. In 
fourth layer of Fig.3, the relationship between Wj and C is described. For example, the 
rules are as follows, in case where W 2 is small: 

1. If Wj is Small, then C is Small. 

2. If Wj is Medium, then C is Medium. 

3. If Wj is Large, then C is Large. 








Fig. 2. Fuzzy membership function. 
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If Wj is given, the fitness Wj^, and for small, medium, and large in C is 
calculated from the membership functions in Fig.4. Then-part in case where W 2 is 
small is the inference result of the fourth layer given by 



^4s - ( 1 ) 

First of all, the inference result of the fourth layer is calculated. Finally, the 
inference is obtained. 



4 Tuning up fuzzy variables 

Since the performance of the estimator depends on the fuzzy variables, its parameter is 
adjusted to obtain the good performance. The optimum parameters are found by 
steepest decent method so as to minimized the performance function J, which is the 
sum of the waiting time / of the cars 



where N is total number of cars. 

The parameter w^^ is given by 

Wii -W\\-0C — — (3) 

where a is small positive number and corresponds to the training rate. 

The partial differential values are calculated by using membership functions 
illustrated in Fig.4. Other parameters are calculated in the same manner as the above 
equations. In the tuning, the parameter of the fuzzy variables is adjusted from the first 
layer to the fourth layer like the back propagation method. 



v-i 



s/ 

N 



( 2 ) 



5 Simulation of traffic 

Sequence of signal change is as follows: red, blue, yellow, blue for right turn, yellow 
for right turn, and red. Holding time of yellow is 4s. First, simulation is carried out in 
case where traffic volume is 400 cars /h for east-west direction and north-south 
direction. Two kinds method of signal control are applied to the simulation. Secondly, 
after 1 hour, the traffic volume changes from 400 cars /h to 800 cars/h for north-south 
direction only. Simulation result shows in Fig. 5. After the change of the traffic 
volume, the waiting time increases in case where the method of fixed timing is 
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Time[h] 

Waiting time of cars to go from east to west in case 
^ where signal controlled by fuzzy inference. 

^ Waiting time of cars to go from north to south in case 
where signal controlled by fuzzy inference. 

Waiting time of cars to go from east to west in case 
where signal changes by fixed cycle 
A Waiting time of cars to go from north to south in case 
where signal changes by fixed cycle. 



Fig. 3. Simulation result. 

applied. Moreover, after the traffic volume is return to the normal situation (400 
cars/h), the waiting time is not recovered. On the other hand, the waiting time is 
almost kept constant without the variation of the traffic volume in case where the 
traffic signal is controlled by the fuzzy controller. A good performance is obtained by 
applying the multi-layered fuzzy controller. 



6 Conclusions 

This paper dealt with the problem concerning traffic control. A new method based on 
the multi-layered fuzzy inference was proposed. The fuzzy variables and the control 
rules play the important role in the performance of the controller. The validity of the 
proposed method was verified by the simulation. 
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Abstract. Approximation region-based decision tables are tabular spec- 
ifications of three, in general uncertain , decision rules corresponding to 
rough approximation regions: positive, boundary and negative regions. 
The focus of the paper is on the extraction of such decision tables from 
data, their relationship to conjunctive rules and probabilistic assessment 
of decision confidence. The theoretical framework of the paper is a vari- 
able precision model of rough sets. 



1 Introduction 

Rough sets theory has been an active area of research for almost twenty years 
since the first pioneering works on the topic were published by Pawlak [3]. This 
novel set representation model, capturing the relationship between the ability to 
discern observations, and the ability to define a subset of observations, turned 
out to be both fundamental and general enough to stimulate significant develop- 
ments in logic, machine learning, data mining, control theory and a number of 
application areas (see, for example [2,5,7,10]). Despite its relative popularity and 
the interest in this research paradigm, some fundamental questions still remain 
to be answered. One of these questions concerns the major difference between 
rough set-based approaches to machine learning and data mining versus simi- 
lar methodologies developed outside the rough sets community(e.g. [13]). Other 
questions relate to whether we are utilizing the capabilities of the rough sets ap- 
proach properly in the context of machine learning and data mining applications, 
and whether we are taking full advantage of the power of the methodology. 

These questions are motivated by the fact that rough set-based approaches 
to machine learning and data mining are primarily concerned with acquisition 
(so-called “induction”) of conjunctive rules from data, contributing new, or 
improved rule extraction algorithms to already existing ones (e.g. [4,8]). In this 
sense, the results of rough sets research cannot be distinguished from results 
of other fields and, consequently, lead to questions regarding the existence of 
essentially new unique contribution of this methodology in comparison to other 
approaches. 

It has been my opinion for some time however, that the rough sets method- 
ology has a unique “product of its own” to offer, a “product” which has numer- 
ous advantages over conjunctive rules, and despite of this seems to have been 
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neglected in much of recent research and applications dominated by rule com- 
putation techniques. This "product” is a whole methodology of decision tables 
analysis and acquisition from observation data. 

Therefore, the main goal of this paper is a presentation of the methodology 
of extended decision tables extracted from data, referred to as approximation 
region-based decision tables. The approximation region-based decision tables en- 
capsulate uncertain, in general, decision rules extracted from data. The decision 
rules correspond to approximation regions of the target set. In the following sec- 
tions, the capabilities of approximation region-based decision tables as tools for 
predictive decision making are investigated and compared to conjunctive rules 
in the context of the variable precision model (VPRS) of rough sets [9,12]. The 
VPRS model has been chosen here rather than the original model of rough sets 
since in most of the practical problems occurring in machine learning, pattern 
recognition and data mining the inter-data relationships are probabilistic in na- 
ture, leading to non-deterministic decision tables. 

2 Approximation Region Rules 

Decision tables are fundamental to decision making, and hardware/software de- 
sign methodologies [14]. In their original formulation they represent functional 
relationship between sets of observations and decisions. The relationship is ex- 
pressed in the form of a collection of conjunctive deterministic rules: 
{conditioni) A {condition^ A ... A {conditionm) {decision) , 

In this approach the decision tables are designed manually, reflecting pro- 
cessing requirements of the designer. However, in problems related to machine 
learning, pattern recognition and data mining, deterministic decision tables can- 
not be constructed due to lack of proper decision knowledge or the inherent non- 
determinism (i.e. partial functionality, or lack of functionality) of the relationship 
between conditions (observations) and decisions (recognitions, predictions, etc.). 
Consequently, in many problems there is a need to use observation data, rather 
than the designer’s knowledge, to derive the decision table (non-deterministic in 
general) . 

In the rough sets theory, the decision table is a central notion. Speciflcally, 
the data-extracted decision table represents partial functional dependency [1,2] 
occurring in the data. For the purpose of this article, it can be defined as follows: 
Given the subset X C U of the universe C, the decision table with respect 
to deciding (predicting) whether an object e E U also belongs to the set A, 
corresponds to rough characterization [2] of the set X in the approximation 
space (U^R)^ where R is the equivalence relation (indiscernibilty relation) on 
typically defined in terms of combinations of values of attributes a eA (we 
assume that the set of attributes A is finite and each attribute has finite domain). 
In other words, the decision table for the target set X (the concept) ^ in the 
simplest case, is a set of disjoint conjunctive rules, referred to as elementary 
rules ^ conforming to either of the following three possible formats: 

- A, X • ^ (“2,^02) A ...(am, «a„) ^ + with supp{r^^j^) CPOS{X) 




180 



W. Ziarko 



(«i,«ai) A (a 2 ,«a 2 ) A with supp{r^^^) (1 B N D{X) 

- : (ai,Wai) A {a 2 ,Va^) A... ^ - with supp{r^^^) C NEG{X) 



where E is an elementary set of the relation R, POS(X)^ BND(X) and NEQ(X) 
are respectively the positive, boundary and negative regions of the set X [1] and 
j 5 ^) is a rule support set, that is a set of objects belonging to U and 
matching the conditions of the rule. 

The above set of rules can be re-expressed in the form of a maximum three 
rules in disjunctive normal form (DNF), each of which corresponding to exactly 
one rough approximation region POS(X)^ BND(X) and NEG(X): 



r+ 


V((ai,Wai) 


A 


(«2,«a2) 


A .. 




, for positive region 


? 


V((ai,Wai) 


A 


(«2,«a2) 


A .. 




for boundary region 


r-x 


'^({aiXai) 


' A 


(«2,«a2) 


■ A .. 




, for negative region. 



Thus, when deriving decision tables from data in the rough sets approach 
we are essentially facing the problem of acquisition of a maximum of three rules 
in the very uniform DNF format. We will refer to these rules as an approxi- 
mation region rules, or simply as positive rule, negative rule or boundary rule. 
The decision table with approximation region rules will be referred to as an 
approximation region-based decision table. 



3 Probabilistic Decision Tables 

In most practical problems involving processing large amounts of data, all equiv- 
alence classes of the relation and the relationship between classes and the 
target set (concept) X are unknown. Usually, the data in our possession is a 
proper subset, a sample of the universe U. We will assume that this is a uni- 
formly distributed random sample, although in practice this is normally not the 
case. Also, in typical problems the space U is infinite comprising all past and 
future occurrences of some observations (called elementary events in probability 
theory). Our objective, in case of probabilistic approximation region-based deci- 
sion tables (probabilistic decision tables in short), will be to use the data sample 
to identify, with sufficient confidence, the following: 

— All or almost all equivalence classes of the relation R by their discriminating 
descriptions; 

— The discriminating descriptions of the generalized VPRS approximation re- 
gions of the target set X , that is of POSu{X)^ BNDi^u{X) and NEGi{X) 
where 0 < I < u < 1 are lower and upper limits parameters respectively of 
VPRS model as defined below. 

The main ideas of VPRS model [12] approximation regions are summarized 
as follows: 
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Let R* he the set of equivalence classes of the indiscernibilty relation R and 
let E G R* be an equivalence class (elementary set) of the relation R. With each 
class E we can associate the estimate of the conditional probability F[X\E) by 
the formula: P[X\E) = card[X fl E) / card[E) if sets X and E are finite. 

For any subset X C U we define the u-positive region of W, POSu(X) as a 
union of those elementary sets whose conditional probability P[X\E) is higher 
than the upper limit , that is 

POSu{X) = |J{C G R* : P{X\E) > u} 

The (I /a) -boundary region BN Ri^u{X) of the set X with respect to the lower 
and upper limits £ and is a union of those elementary sets E for which the 
conditional probability P[X \ E) is higher than the lower limit £ and lower than 
the upper limit u. Formally, 

BNRi^u{X) = eR^ :£< P{X\E) < u} 

In practical decision situations, the boundary area represents objects which 
cannot be classified with sufficiently high confidence (represented by u) into 
set X and which also cannot be excluded from X with the sufficiently high 
confidence ( represented by 1 — /). 

The l-negative region NEGi[X) of the subset A^, is a collection of objects 
which can be excluded from X with the confidence higher than 1 — that is, 

NEGi{X) = e R* : P{U - X\E) > I - 1 } 

Based on the above definitions of approximation regions, the approximation 
region rules defined in the context of the original model of rough sets in the 
previous section can now be generalized to non-deterministic approximation re- 
gion rules with attached estimates of conditional probabilities of their respective 
consequents (confidence factors): 

— rj : V((ai, A (a 2 , t^a2) A ^ + with conditional probability 
F(+| (ai,r^ai) A {02, Va^) A ...{amPOaJ}) > U for positive region; 

— : V((ai, A (a 2 , t^a2) with "inconclusive” conditional 

probability I < P{-\-\ A {02 ^Va^) A ^ boundary 

region; the boundary area rule is relatively inconclusive in the sense that the 
probability of positive outcome is not high enough (i.e. at least u ) and not 
low enough (i.e. at most I ) . 

— = V(^(ai, A(a2, t^a2) A...(a^, t^a^)) ^ ~ with conditional probability 

P{-\ {ai,Va^)A{a2,Va^)A..,{am,Va^)) > l-l for negative region. 

These three rules can be summarized in the form of a probabilistic approxima- 
tion region-based decision table as illustrated in Table 1 . 
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REGION 


AGE 


SEX 


SKILL 


INCOME 


P[T\conditions) 


POS 


young 

young 


female 

male 


yes 

no 


+ 


0.85 


END 


medium 

young 


male 

female 


yes 

no 


? 


0.6 


NEC 


old 

old 

medium 


male 

female 

male 


no 

yes 

no 


- 


0.9 



Table 1. Example probabilistic decision table with / = OT and u = 0.85 

4 Estimating Probability Distributions 

The problem of estimating probability distributions based on a random sam- 
ple has been investigated in probability theory. In brief, the main theorems of 
probability theory, known as laws of large numbers, assert that if the sample is 
collected at random and without bias, and if its size is sufficiently large, then the 
frequency distribution-based estimator of the probability of an event will con- 
verge to the actual probability with growth of the sample size. Consequently, if 
the sample size is greater than the assumed threshold level n then the computed 
estimates are very likely to be close approximations of the actual probabilities. 
The choice of the threshold n depends on the required confidence level and the 
actual value of the estimated probability (which is unknown) meaning that in 
practice somewhat arbitrary value of n usually have to be selected. The impli- 
cations of these well-known facts in the process of decision table construction 
from data are summarized below. 

Given approximation space (U^R)^ let U' C U be the random sample and 
let n be the threshold sample size. Also, let X\R' be the restrictions of the 
target concept X and of the relation R to the sample, that is X'= X D U' and 
R'= R D [X' X X'), To construct an approximation region-based decision table 
with credible assessments of probability distributions the following requirements 
need to be satisfied: 

— : y G t/'} = {[y]R y ^ R} that is, all equivalence classes of the 
relation R must be represented among the classes of the relation R'. 

— card(POS^t(A))) > n, card(BND/^^t(A')) > n, card(NEG/(A') > n that is, 
credible confidence factors are to be associated with the computed approxi- 
mation region rules. 

The satisfaction of at least one condition in the second requirement guaran- 
tees that card{ U') > n which means that a credible estimate of the probability 
P(X) of the target concept XC U can also be calculated by taking the ratio 
card(X')/ card{ U'). The probabilities P(X)^ 1- P(X) can be associated with two 
"empty antecedent" rules: 



r J 0 ^ + with the probability P(~h) = P(X) 
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— : (f) ^ — with the probability P(-) = P(U-X), 

The above two rules reflect the situation of extreme lack of knowledge, when 
there is no other information about objects e ^ U' ^ with the exception of fre- 
quency distribution of positive and negative cases. Therefore, these two rules are 
not very useful if the set of attributes is non-empty and if approximation region 
rules of sufficiently high confidence can be computed. To take full advantage of 
this extra knowledge (i.e. attributes and their values) the parameters lower limit 
I and upper limit u for the approximation regions should be set in such a way as 
to provide for some gain of the predictive capability over the ” empty antecedent 
rules”, that is, it is required that: (1) u> P[p) and (2) I < P[p). 

The requirement (1) guarantees that the positive region rule wifi predict + 
with higher confidence and smaller error than random choice and the requirement 
(2) guarantees that the negative region rule wifi predict - with higher confidence 
and smaller error than random selection. 

5 Conjunctive Rules v.s. Approximation Region Rules 

The number of data cases matching the rule condition part is referred to in data 
mining literature as the strength of the rule (alternative naming conventions used 
are scope of the rule or rule coverage). For acceptable assessments of the con- 
ditional probabilities (confidence factors) associated with the rules the strength 
should be higher than the threshold n. In practical applications, however, par- 
ticularly when the number of attributes is large, the elementary rules tend to 
be weak, often supported by just a few cases. This means that their confidence 
factors are likely to be unreliable. On the other hand, even with relatively many 
weak elementary rules contained in an approximation region rule, the support 
level for the approximation region rule may exceed the threshold n resulting in a 
credible estimate of the rule’s conditional probability. In fact, all that is required 
is that the sum of support levels over all elementary rules contained in a given 
approximation region rule be greater than the threshold n, that is, for the rule 
= V it is required that '^supp(r'^^ j 5 ^) > n , 

The above condition indicates that even in the absence of sufficient support 
for the elementary rules, the approximation region rules would still be acceptably 
supported and, consequently, could be used for predictive decision making. This 
leads to the conclusion that elementary rules should not be used individually for 
decision making but in sets corresponding to sufficiently strong approximation 
region rules. 

Most of the interest in rule ’’induction” from data using rough sets method- 
ology is focused on computation of minimal length conjunctive rules correspond- 
ing to prime implicants of the associated discernibility function [6]. Outside the 
rough sets approach, similar rules are computed by other methods [14]. The 
common feature of all these approaches is that they attempt to minimize the 
number of conditions in the antecedent parts of the rules. We wifi refer to such 
rules as minimal rules. An important advantage of minimal rules over other 
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kinds of rules, such as those corresponding to root-leaf paths in decision trees or 
elementary rules, is the relatively high strength of the minimal rules. In what 
follows we compare the minimal rules to the approximation region-based rules. 
The comparison is conducted with respect to such criteria as completeness of 
the rule set, its increment ality and support level. To ensure fairness of the 
comparison we will assume that we are comparing an approximation region rule 
versus a set of minimal rules whose support sets cover the same approximation 
region computed from the sample U' C U. 

— Completeness: Since the relationships between minimal rules, in terms of 
the overlaps between their support sets, are not known, it is impossible to 
determine how complete the set of minimal rules is. On the other hand, 
the elementary rules have disjoint support sets and their number is combi- 
natorially bounded which means that definite completeness criteria can be 
established. 

— Increment ality: A new minimal rule added to the existing collection of 
rules may have its support set covered by support sets of existing rules which 
essentially means that the new rule is redundant. An extra elementary rule 
for an approximation region will always contribute, in terms of covering 
previously uncovered cases, to the set of existing elementary rules. This 
means that the approximation region rule would grow incrementally as new 
sample cases are added to the database. 

— Support: Support level of an approximation region rule is typically higher, 
and never lower than support of a minimal rule for the same approximation 
region, for obvious reasons. On the other hand, one could argue that a dis- 
junction of all minimal rules would have the support level as good as the 
corresponding approximation region rule. Such a rule, however, would be 
ill-defined in the context of our requirements, as there is no guarantee that 
its confidence factor would not be lower than the threshold level even if all 
component minimal rules would satisfy that requirement. The other aspect 
is that it would not be possible to determine the support level of such a rule 
in terms of support levels of the component minimal rules leading to serious 
implementation difficulties when dealing with large databases. 



6 Conclusions 

The article introduced approximation region-based decision tables, both in prob- 
abilistic and non- probabilistic setting. It appears that decision tables of this kind 
are superior data-extracted knowledge representation technique to conjunctive 
rules as it has been argued in the last section of the paper. The necessary soft- 
ware for generation of such tables is currently being incorporated into KDD-R 
system for data mining [11]. Significant potential for applications of this tech- 
nique seems also to exist in pattern recognition, machine learning and trainable 
control. 
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Abstract. Today’s Data Base Management Systems do not provide 
functionality to extract potentially hidden knowledge in data. This prob- 
lem gave rise in the 80 ’s to a new research area called Knowledge Discov- 
ery in Data Bases (KDD). In spite the great amount of research that has 
been done in the past 10 years, there is no uniform mathematical model 
to describe various techniques of KDD. The main goal of this paper is to 
describe such a model. The Model integrates in an uniform framework 
various Rough Sets Techniques with standard, non Rough Sets based 
techniques of KDD. 

The Model has been already partially implemented in RSDM (Rough Set 
Data Miner) and we plan to complete the implementation by integrating 
all the operations in the code of database management systems. 

Operations that are defined in the paper have successfully been imple- 
mented as part of RSDM. 

1 Introduction 

KDD process was first defined as the Mon trivial process of extracting valid po- 
tentially useful and ultimately understable patterns in data^ (see [8]). Since the 
appearance of the KDD term a lot of research has been developed around the 
efficient implementation of algorithms to extract knowledge from data [12,10,5]. 
Some of the algorithms that have been studied come from different research fields 
such as statistics, machine learning, artificial intelligence. They have just been 
adapted to tackle extralarge databases. 

Some implementations deal with the integration of the algorithms with Rela- 
tional Databases (see [4], [9], [2], [13]). In such systems, the algorithms are run 
over the relational database making use of SQL to select and order the ob- 
jective data. For some of these algorithms a tightly coupled version has been 
programmed [1] has been shown to improve the efficiency but still improvements 
could be obtained if such algorithms would be programmed as operations of the 

This work is supported by the Spanish Ministry of Education under project PB95- 
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RDBMS because the execution of the algorithms will take advantages of the 
optimizer of the system. 

In spite of the great amount of research that has been around KDD there is no 
mathematical model to describe in an uniform way the whole process of discover- 
ing patterns, although a first approach of a model of generalization is presented 

in [3] . 

In this paper we present a first approach to such a model. Basic functions have 
been modelized and some properties of such functions have been extracted. 

The rest of the paper is as follows: in section 2 the universe of the functions as 
well as the preliminaries of the model are established. In section 3 basic func- 
tions are defined in mathematical terms what allows us to extract the similarities 
among such functions. To conclude, in section 5 the future lines of research are 
outlined. 

2 DMM: Data Base Mining Model 

Database Mining, as we understand it, is an extraction of knowledge from a 
certain collection of information. We extract the knowledge by identifying similar 
information. We assume that information is given by means of attributes and its 
values. In order to describe formally the information similarity and knowledge 
extraction, we also assume that every piece of information in the collection is 
uniquely identified. We will refer to these unique identifiers as objects. 
Remark: In relational databases identifiers are represented by the key attribute 
and the information is determined by the values of the rest of the attributes in 
a certain table. 

Similarity of information is a nice concept, but difficult to implement. So we 
make an stronger assumption: We deal only with some forms of equivalent 
information. 

Given a set of attributes we can define many equivalence relations on the set 
of objects. We say that two objects in a database are equivalent with respect 
to certain attributes if they both have the same values associated with these 
attributes. 



2.1 Preliminaries 

In any database mining situation, we have to assume that we have at least 
objects and attributes. 

So, in our investigation we assume that we have a finite set O not empty of 
objects and a finite not empty set AT of attributes. We also have to assume 
that attributes have associated values with them. On the other hand, as we are 
trying to mine databases looking for information, we can assume that we deal 
with non empty databases. 

For the sake of simplicity, we assume that objects and attributes are disjoint 
sets. 

We see elements of the universe through the information about them. So, we 
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assume in our investigation that with each element x of the universe O (x G O) 
we associate a certain information i(x). Moreover, the information i(x) is the 
set of all elements of the universe that are the same as x with respect to the 
information i. 

In the case of databases, the function i refers to an attribute or combination of 
attributes (descriptor). If we impose every attribute to be a monovalued function 
(this means one object for an attribute can only take a value), any i, induces a 
partition of O into equivalence classes. So, it happens that i(x) can be seen as 
the name of the equivalence class of x regarding information i. That’s why, we 
can effectively define i as a function that maps O in V{0). Formally, we define 
the information function i about the universe O as any function: 

i: O — > V{0) 

such that y ^\{x) means that y and x are the same with respect to information 



'ix e O, i(a;) = {y e O : i(a;) = i(y)} = [x]i 



Obviously 



0/ {[x]i,x e 0} 

Definition: We define a function r in the following way: 
r : V{AT) X V{0) — > V(V{0)) 
such that for all C AT, O = {x\,X 2 ^ • • • , Xn}^ Ox C O: 



r{AT, O) = { 0 : 2 }, • • • {xn}} 



r(0,O) = {0} 



r(^,,0) = {0} 

r{A^,0^) = {Ox! 

2.2 Data Mining Universe 

Let O ^ O finite set of objects and let AT ^ 0, AT finite set of attributes 
Let IZ be the set every equivalence relation defined on O using the descriptors 
from AT. 

Definition : The set 
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will be called Data Mining Universe from now on. 
Observe that the universe can also be defined as: 



U = V{[j{r{Ax,0),Ax ^ AT}) 

Remark: Observe that the universe generation complexity is exponential but 
it is important noting that it is a theoretical construction only needed for the 
model to be establish. In any case it will be needed to generate it while running 
any algorithm. 

In what follows we will try to express rough set operations as functions over this 
universe. In order to express all the operations in a uniform way we define a 
generic function that will be called data mining operator. 

Definition: We define Data Mining operator as any partial function of the form: 
g: U X V{AT) — ^ U x V{AT) 



such that if 



g{X,Ax) = {Y,AY) 



then: 



XCr(Ax,0) 



YCr(AY,0) 

Ay C Ax 

Once the basic function has been defined we will try to defined basic Tata Mining 
operations in this terms. 

3 Modelization of Operations of Data Mining Using 
Rough Sets 

3.1 Projection operation 

Projection operation selects certain attributes (columns in a relational table), of 
the objects taken into account and eliminates the rest of the attributes. 

Fact: If Ap C Ax 



0/ -Ap^O/ 



'ix e 0/ ^Ap e 0/ ^Ax 



meaning that : 
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such X ^ y 

Having Ap C Ax, the family of partial functions Pap is defined in the following 
way: 

Pap-Ux P{AT) — >U X P{AT) 



Pap{X,Ax) = {Y,Ap) 

where, Y = {[:r]xp} for all X such that, Ap C Ax and X C r{Ax^O) Observe 
that, 

Wy eY, 3x e X : y D X 
where: Uj^gy V = U*gx ^ = 0 



3.2 Selection operation 

This operation is used to select the subset of elements of the universe of objects 
that match a condition. The condition is a boolean expression specified on the 
attributes and their values. 

The condition of selection is expressed as a well formed formula of first order 
logic: 



{x G O : f{x)is true} 

Definition: The condition / is expressed as: 

Ai=i 

where: 

Vi G Dom{Ai)\/i 

q; is a comparison operator that is, o; G {=,] >,>,<,<} 

We get now two sets of objects: 

1. O/ = {x G O such that f{x) = True} 

2. Af = {A e AT belonging to the alphabet defined by /} 

Property: Given a set of attributes A/, and a set of objects O, The set of at- 
tributes Af defines a equivalence relation on O O / ^Af 

Definition: Selection operation: Given f^Of^Af the family of functions Sf 
is defined in the following way: 



Sf :U X P{AT) — 'rU X P{AT) 

Sf{X, Ax) = {{xgX -.xnOf^dl}, Ax) 
for all X and for all Af such that X C r{Ax,0), Aj C Ax 
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3.3 Lower operator 

Remark: For a further explanation of Rough Sets theory see [6,7,11,14]. 

A family of functions Ic^ is defined where C represents the concept which lower 
approximation is going to be defined with respect to the descriptor Ax- 



Ic- Ux V{AT) — > Ux r{AT) 



lc{X,Ax) = {{z&X :xCC},Ax) 

X C r{Ax) 

Property 

If lc{X,Ax) = {Y, Ay), then F C X 

3.4 Upper operator [6,7,11,14] 

The family of functions uc, is defined as the upper approximation of objects 
belonging to C with respect to to the descriptor Ax : 



uc: Ux P(AT) — > Ux P(AT) 
uc(X, Ax) = ({zeX:znCA 0 }, 

X c r(Ax) 

Let’s see now how all the operations that have been studied can be expressed by 
means of this basic operators so we can modelize the process by using them. 

3.5 Key Elimination 

Let T(X, Cl,... ,Cn,D) be the relational decision table containing the target 
data. Key elimination can be expressed in terms of the operators of the model 
as: where K represents the key attribute. 

3.6 Eliminating those Equivalence Classes which Cardinality is not 
Higher than a Predetermined Value 

Users will be interested in rules or descriptions of the concepts that are supported 
by a sensible number of objects. So those rows corresponding to descriptions not 
supported by enough number of elements can be eliminated. Observe that this 
operation be expressed by means of selection operation. Let’s called the function 
^threshold- Then / Can be expressed as / = counter > threshold supposing that 
the counter attribute exists in the table. So we have: 

^/threshold — ^/(X, AT) 




192 



M.C. Fernandez-Baizan, E.Menasalvas Ruiz, A. Wasilewska 



3.7 Reduction of Attributes 

This step deals with the finding of a small set of attributes to represent the data, 
hence it eliminates all the superfluous information. 

Dealing with the basic operations that have already been described, the result 
of applying reduct calculation to the table can be expressed as a projection over 
the selected attributes. 

3.8 Generalization 

Should you have a concept hierarchy available for some attributes, in preprocess- 
ing phase a generalized attriould be obtained. Thus before applying the mining 
algorithm if a particular generalization is desired in this point all that rest is to 
project over the generalized attribute. So Let Ag be the generalized attribute 
of attribute A that is wanted for a particular query. Let Ga a notation for the 
operation of generalization. It can be expressed: 



Gai (^(^1, ^ 2 , • • • , An, Aig)) = IIai A 2 ,... ,An 



3.9 Discretization of Values of Attributes 

Discretization of attributes is a particular case of generalization in which values 
of the attribute are generalized, so it can be expressed by means of projection 
operation. 

4 Conclusions 

We have presented here a Model, D MM, which identifies stages of Data Mining 
process and describes them in terms of various Data Base mining operations. 
Some of the operations that have been presented are Rough Sets basic functions 
such as the lower and upper approximation of a concept. 

These operations have been successfully defined as operations of the model. Also 
basic KDD operations not belonging to Rough Sets theory have been studied 
and described as part of the model such as discretization, generalization and 
preprocessing operations. As a consequence we can establish that the model 
allows to define in a uniform way KDD operations. 

5 Future Developments 

The model is a result of the detailed study of the implementation of RSDM 
system. The functions that have been described as part of the model have suc- 
cessfully been implemented. On the other hand, the model has allowed to design 




A Model of RSDM Implementation 193 



a graphical interface in which the process of data analysis is displayed to the 
final user. 

Implementation of algorithms to obtain association rules that have been shown 
to be able to be described as operations of the model is under development. 
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[ Abstract.] In this paper, we propose a new query an- 
swering system for an incomplete Cooperative Knowledge- 
Based System (CKBS). CKBS is a collection of autonomous 
knowledge-based systems called agents which are capable 
of interacting with each other. In the first step of the query 
processing strategy, the contacted site of CKBS will iden- 
tify all locally incomplete attributes used in a query. An 
attribute is locally incomplete if there is an object in a 
local information system with an incomplete information 
on this attribute. The values of all locally incomplete at- 
tributes are treated as concepts to be lesirned at other 
sites of CKBS (see [6]). Rules discovered at all these sites 
are sent to the site contacted by the user and used locally 
by the query answering system to replace an incomplete 
information by values provided by the rules. 

In the second step of the query processing strategy, an 
incomplete information is removed from the local infor- 
mation system in a maximal number of places. Next, the 
query answering system finds the answer to a user query 
in a usual way (similsir to CKBS query answering system). 

Key Words: incomplete information system, cooperative 
query answering, rough sets, multi-agent system, knowl- 
edge discovery. 



1 Introduction 

By a cooperative knowledge-based system [CKBS) we mean a collection of 
autonomous knowledge-based systems called agents (sites) which are capable of 
interacting with each other. Each agent is represented by an information system 
(either complete or incomplete) and a collection of rules called a knowledge base. 
Any site of CKBS can be a source of a local or a global query. By a local query 
for a site i (or Ureachable query) we mean a query entirely built from attributes 
which are complete and local at site i. Local queries need only to access an 
information system of the site where they were issued and they are completely 
processed on the system associated with that site. In order to resolve a global 
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query for a site i (built from attributes not necessarily complete or local at site 
i) successfully, we have to access an information system at more than one site 
of CKBS and discover rules describing attributes (used in a query) which are 
either not complete or not local at the site i. Rules discovered by neighbors of i 
are sent to the site i and used locally by the query answering system to replace 
some of the incomplete vales in a local information system by values provided 
by the rules. After the process of removing as many incomplete vales as possible 
in the information system of site the query answering system finds the answer 
to a user query in a usual way (similarly to CKBS query answering system). 

There is a number of strategies which allow us to find rules describing decision 
attributes in terms of classification attributes. We should mention here such 
systems like LERS (developed by J. Grzymala-Busse), DQuest (developed by 
W. Ziarko), AQ15 (developed by R. Michalski) or rules discovery system based on 
discriminant functions proposed by A. Skowron (see [8]). Most of these strategies 
have been developed under the assumption that the database part of KBS is 
complete. Problem of inducing rules from attributes with incomplete values was 
discussed in ([2], [3], [4]). Our strategy shows how to compute such rules with 
certainty factors not necessarily equal to 1 and next how to use them to make 
local information system more complete. The Chase algorithm presented for 
instance in [1] is using dependencies to make a database more complete. We use 
rules learned at remote sites to achieve a similar goal. 



2 Basic definitions 

In this section, we introduce the notion of an information system, distributed 
information system, a knowledge base, and s(i)-queries which can be processed 
locally at site i. 

By an information system ([5], [4]) we mean a structure S = (W, A, R, /), 
where A is a finite set of objects, A is a finite set of attributes (or properties), V is 
the set-theoretical union of domains of attributes from A, and / is a classification 
function which describes objects in terms of their attribute values. We assume 
that: 

— V = |J{ Va : a £ A} IS finite, 

“ Ca C \4 = 0 for any a, 6 G A such that 

— f : X X A — ^ 2^ where /(x, a) e 2^“ — {0} for any x e X, a e A, 

If /(x,a) = VQ, then the value of the attribute a for the object x is un- 
known. We will call system S incomplete if there is a G A, x G A such that 
card[f{x^ a)) > 2. Also, if card[f{x^ a)) > 2, then the attribute a is called in- 
complete. Otherwise system S as well as the attribute a are called complete. 
The set of all incomplete attributes in S we denote by 7n(A) and the set 
U{^ ' ^ ^ /n(A)} by In{V). For simplicity 

reason any complete or incomplete information system will be called, in this 
paper, an information system. 

Let = (Ai, Ai, Fi,/i), S 2 = (A 2 , A 2 , be information systems. 
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— S'2, are consistent if /i(x,a) C /2(^,a) or /2(^,<^) ^ /i(x,a) for any 

Cl G 0 ^2y ^ ^ -^2 * 

By a distributed information system [8] we mean a pair VS = ({ 5 '^ }i 6 l, L) 
where: 

— Si = [Xi^Ai^ fi) is an information system for any i ^ 

— L is a symmetric, binary relation on the set /, 

— / is a set of sites. 

System DS is called incomplete, if {3i G I)[Si is incomplete]. 

Systems Si^Sj (sites z,j) are called neighbors in DS if (i, j) G L. The tran- 
sitive closure of L in / is denoted by , 

A distributed information system DS = is consistent if: 

(Vi) (Vj ) {\/x G n Xj ) ( Va G n Aj ) 

\{x,a) e Dom{fi) r\Dom{fj) — > fi{x,a) C fj{x,a) or fj{x, a) C fi{x,a)]. 

By a set of s(i)-terms we mean a least set D such that: 

— 0 , 1 G "i i , 

— (a, w) E Ti for any a G and w E 

— if tiX2 e Ti, then {ti + ^2), (A ^ ^2), ^ h ^ D, 

We say that: 

— s(i)-term t is atomic if it is of the form (a, w) or ^ (a, w) where a £ Bi C Ai 
and w G Aa 

— s(i)-term t is positive if it is of the form ^ ^ A^ and w G W} 

— s(i)-term t is primitive if it is of the form Yli^j • A atomic } 

— s(i)-term is in disjunctive normal form (DNF) A t = '^{tj : j G J} where 
each tj is primitive. 

By a query for a site i (s(i)-query) we mean any element in D which is in 
DNF. 

Before we give the interpretation of s( A -queries, we introduce the notion of 
A-algebra. So, let us assume that A is a set of objects. By an A-algebra 
we mean a sequence (P 7 ©7 ©7 “■) where: 

— P = {Pi : i e J} where Pi = {{x,p<x,i>) ■ P<x,i> € [0, 1] & x € X}, 

— Pi<S>Pj = {{x,p<x,i> -p<x,j>) ■ X e X}, 

— Pi^Pj = {(.x,max{p^x,i>,P<x,j>)) ■ X G X}, 

— ^Pi = {{xX-P<x,i>) ■ X € X}, 

— P is closed under the above three operations. 

[ Theorem 1.] Let Pi^Pj^Pk G P. Then: 

— iP(^Pj)(^Pk = r(^{Pj(^Pk), 

— iP®Pj)®Pk = r®{Pj®Pk), 

— {P ®Pj)^Pk = {P 0 Pk) ®{Pj 0 Pk), 

— r(^Pj = Pj(^Pi, 
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— c,0c, = c. 

Let DS = {{Sj}j^j^L) be a distributed information system where Sj = 
(Xj ^ Aj^Vj^ fj) and Vj = [JiVja ^ 3 ^ ^ standard 

interpretation of s(i)-queries in DS we mean a partial function from the set 
of s(i)-queries into Jt^-algebra, defined as follows: 

— Dom{Mi) C Ti , 

— M^((a, w)) = {{x^p) : X ^ Xi ^ w E fi{x^ a) ^ p = 1 /card{fi{x^ a))} for any 

w E Vi ^ 

— Mi{r^ {a,w)) = ^Mi{{a,w)) 

— for any atomic term ti(a) G (a,^c)} and any primitive term t = 

n{s(6) : {s{b) = {b,Wb) or s{b) =- {b,Wb)) Sz {b e Bi C Ai) k {wb G Vit)} 

we have 

Mi{t ^ ti(a)) = Mi{t) Mi{ti) if a ^ Bi 
Mi{t ikti{a)) = $ if a e Bi and ti(a) ^ s(a), 

Mi{tic ti[a)) = Mi{t) if a £ Bi and ti(a) = s{a), 

— for any s(i)-terms tiA2 

Ali{ti +^ 2 ) = ^ Mi[t2)- 

By (/c, i)-rule in DS = [{Sj}j^j^ L)^ k^i E we mean a pair (t, c) such that: 

— either c G In{Vi) Pi \4 or c G \4 — bi, 

— t is a positive s(/c)-term which belongs to Tj^ fl 

— if {x,pl) G Mk{t) then {3p2)[{x,p2) G Mk{c)] . 

An object x satisfies a rule r = (t, c) with a certainty p at site fc, if p = pi • p2^ 
{x,pl) G Mk{t), and (x,p2) G i\4(c). 

We say that (/c, i)-rule (t, c) is in fc-optimal form if there is no other subterm 
tl G Tj^ n of s(/c)-term t, such that: if x satisfies rule (t, c) with certainty p^ 
then X satisfies rule (tl, c) with the same or higher certainty. 

Let X = {xi : 1 < i < n} and Xi satisfies the rule r = (t, c) with a certainty 
Pi at site k for any i G {1,2, We say that r has certainty p, if 

p = [X{pi : Pi ^ 0 k 1 < i < n}]/[card{i : pi ^ 0 k 1 < i < n}]. 

By a knowledge base D^i we mean any set of (/c,i)-rules satisfying the con- 
dition below: 

if (t, c) G Dki then (3ti)(ti,- c) G Dki^ 

We say that a knowledge base Dki is in fc-optimal form if all its rules are in 
/c-optimal form. 

In [6] we proposed an algorithm to construct a knowledge base Dki in k- 
optimal form. Let us assume that L{Dki) = {(L^) G Dki c G In[Vi)}. The 
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algorithm, given below, converts system Si in to a new more complete in- 
formation system Chase[Si). 

Algorithm Chase[Si^ In{Ai),L{Dki); 

Input system Si = (A^, A, /^), set of incomplete attributes 
In[Ai) = {ai, a 2 , and a set of rules L{Dki) 

Output a system Chase[Si). 

begin 

j := 1; 

while j < k do 

for all c G Vaj do 

while there is x E Xi and a rule (t, c) G L{Dki) 
such that X G Mi{t) and card(/^(x, aj)) ^ 1 do 

fi{x^Ctj^ • C , 

j •= j + 1 

Chase[Si) := Si 

end 

By a standard chase- interpretation Mi of s(i) -queries in a distributed system 
DS^ we mean the standard interpretation Mi of s(i) -queries in a distributed 
information system ChasCi^DS) = L), where: 

- Sj = Sj if j ^ k 

— Sj = Chase[Sj) if j — i. 



3 Cooperative knowledge-based system 

In this section, we define a Cooperative Knowledge Based System (CKBS) and 
introduce the notion of its consistency. We also give an example of CKBS. 

Let {Dki}keKiy C i, be a collection of knowledge bases where Dki was 
created at site k E I for any k ^ Ki and Di = [J{Dki : G Ki}VJRi. By Ri we 

mean a set of rules (t, c) created by an expert and stored at site i. Additionally, 
we assume here that t is an s(i)-term. System ({(5'^, L), introduced in 

([6], [7]), is called a cooperative knowledge-based system [CKBS). 

Rules G Dki 7 (t2,^c2) G Dni are consistent at Site i if At{wl) ^ 

At[w2) or wl = w2 or Mi[tl i<t2) = 0. Otherwise, we call them possibly incon- 
sistent. We say that the knowledge base Di is consistent at Site i if any two rules 
in Di are consistent at Site i. Similarly, we say that the cooperative knowledge 
based system DS = ({(5'^, L) is consistent if Di is consistent at Site i 

for any i E D 

Figure 1 gives an example of CKBS, Rules in the knowledge base of Site 
2 have been computed at another site of CKBS, It can be easily checked that 
these rules are consistent at Site 2. 
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Fig. 1. Site 2 of CKBS 

4 Query Language and Its Interpretation. 

In this section we introduce a query language and propose its optimistic inter- 
pretation in a Site[i) of CKBS. A formal system for handling queries in CKBS 
will be presented in a separate paper. 

Standard chase-interpretation introduced in Section 2, shows how to in- 
terpret s(i) -queries in a Site{i) of CKBS. The question of interpreting DNF 
queries built from values of attributes belonging to a superset of Vi in Site{i) 
remains open. Such queries are called global for a Site i. Their standard inter- 
pretation at Site i of a cooperative knowledge based system Bj)} j€i, y, 

where Dj = [J{Dnj : ti g Kj}^ Si = /^) is proposed. To simplify our 

notation, we write S instead of Si ,we write w instead of and atomic term (a, w) 
and assume that V = Vi = [j{Via : a G and Cs = j ^ ~ C. 

Elements in C 5 are called concepts at site i. 

By a query language Cs) we mean a sequence (A,T, F), where A is an 
alphabet, T is a set of DNF terms (queries), and F is a set of atomic formulas. 
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The alphabet A of L(S\Cs) contains: 

— constants: w where w E ViU Cs 

— constants: 0, 1 

— functors: ^ 

— predicate: = 

— auxiliary symbols: (, ). 

The set of terms T is a least set such that: 

— constants 0, 1 are terms, 

— if UJ is a constant, then w are terms, 

— if ti,t 2 are terms, then ti i<t 2 is a term. 

The set of DNF terms is a least set such that: 

— if t is a term, then t is a DNF term, 

— if ti,t 2 are DNF terms, then ti T ^2 is a DNF term. 

Parentheses are used, if necessary, in the obvious way. As will turn out later, 
the order of a sum or product is immaterial. So, we will abbreviate finite sums 
and products as '^{tj : j G and Yli^j '• j ^ respectively. 

The set of atomic formulas F is a least set such that: 

— if t\A2 £^re DNF terms, then (ti = ^ 2 ) is an atomic formula. 

Let Mi be a standard chase- interpretation of local s(i) -queries m DS = 
[{Sj)}j^j^ L). By a standard interpretation of DNF queries and atomic formulas 
from L{S^Cs) in S'-consistent cooperative knowledge based system 
{{Sj,{IJkj}keKj}j€i:^): where S = KJi) and Vi = \J{Via : «■ € AJ, 

we mean a partial function Ni from the set of DNF queries into A^- algebra 
(P,0,0,G such that: 

(1) for any w G Ca, 

Ni{vj) = Mi{a,vj), Ni{r^ w) = ~>Ni{vj) 

(2) if w e Cs, 

Ni{w) = max({(x,p) : x e Xi & (3n e Ki)(3p > 0)(3t)[(t, w) e & 
(x,p) € Mi{t)]}), 

Ni{^ w) = max{{{x,p) : x £ Xi [3n e Ki)[3p > 0)(3f)[(t, ~ w) e Dni 
& {x,p) € where {x,p) € max[D) iff 

~ {3q > p){{x,p) e D {x, q) e D) 

(3) fV,(0) = 1) = 0, fV,(l) = 0) = A, 

(4) for any terms t, w 

Ni{t-kw) = Ni{t)(^Ni{w), Ni{t-k{r^ w)) = V(f)0 w) 

(5) for any DNF terms ti,t 2 
Ni{h + t2) = Ni{ti) U Ni{t2), 
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(6) for any DNF terms ti,t2 

= h)) = (if then T else F) 

( T stands for True and F for False) 

From the point of view of site the interpretation Ni represents a pessimistic 
approach to query evaluation. If {x^p) belongs to the response of a query t, it 
means that x satisfies the query t with a confidence not less than p. 



5 Conclusion 

We have proposed a new query answering system (QAS) for an incomplete co- 
operative knowledge based system (CKBS). The Chase algorithm based on rules 
discovered at remote sites helps to make the data at the local site more complete 
and the same improve the previous QAS (see [6]) for CKBS. 
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Abstract. Inductive learning systems in a logical framework are prone 
to difficulties when dealing with huge amount of information. In par- 
ticular, the learning cost is greatly increased, and it becomes difficult 
to find descriptions of concepts in a reasonable time. In this paper, we 
present a learning approach based on Rough Set Theory, and more espe- 
cially on its basic notion of concept approximation. In accordance with 
RST, a learning process is splitted into three steps, namely (1) partition- 
ing of knowledge, (2) approximation of the target concept, and finally 
(3) induction of a logical description of this concept. The second step 
of approximation reduces the volume of the learning data, by comput- 
ing well-chosen portions of the background knowledge which represent 
approximations of the concept to learn. Then, only one of these por- 
tions is used during the induction of the description, which allows for 
reducing the learning cost. In the first part of this paper, we report how 
RST’s basic notions namely indiscernibility, as well as lower and upper 
approximations of a concept have been adapted in order to cope with 
a logical framework. In the remainder of the paper, some empirical re- 
sults obtained with a concrete implementation of the approach, i.e., the 
EAGLE system, are given. These results show the relevance of the ap- 
proach, in terms of learning cost gain, on a learning problem related to 
the document understanding. 



1 Introduction 

Inductive Logic Programming [6] is a symbolic approach to machine learning 
within a logical framework. The aim of ILP is to learn logical descriptions of 
concepts from their ground examples and counter-examples, as well as an initial 
model of the domain called a background knowledge. Learning systems such as 
EOIL [13], EOCL [10] and PROGOL [7] are based on the paradigm of ILP. Given 
a concept to learn, called the target concept^ the learning task of ILP is formal- 
ized as a search problem inside the space of the whole candidate descriptions [5] , 
also called hypotheses. Some operators, such as generalization and specialization, 
allow for exploring the space from an hypothesis to better ones. The relevance 
of an hypothesis in comparison with others is estimated with different criteria, 
which are most often related to the number of examples and counter-examples 
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it characterizes, according to the given background knowledge. The search stops 
when an hypothesis which characterizes all the examples but no counter-example 
of the target concept is found. The accuracy of the learned description is then es- 
timated through the classification of unseen examples. The cost of an exhaustive 
exploration of the search space relies on several parameters, such as the language 
used to represent hypotheses, which determines the size of the space, but also 
the number of examples and the size of the background knowledge, the search 
strategy, and so forth. In practice, this cost is prohibitive, and a lot of research 
in the ILP field is devoted to develop learning biases [2, 14,8], i.e., techniques in 
order to prune the search space. 

In this paper, we propose a learning approach which aims at reducing the 
learning cost, by limitating the volume of the background knowledge to be used 
for the search of the description. To achieve this goal, this approach is based on 
Rough Set Theory [9] , and more especially on its notion of concept approxima- 
tion. A learning process comprises three steps, namely (1) partitioning of the 
knowledge, (2) approximation of the target concept, and finally (3) induction 
of a suitable description. In order to decrease the learning cost, the second step 
computes well-chosen portions of the background knowledge, which represent 
approximations of the concept to learn. Then, only one of these portions is used 
during the last induction step. This approach has been implemented through the 
EAGLE system, and evaluated on a real-world dataset related to the problem 
of document understanding [3]. The empirical results obtained by EAGLE show 
the relevance of the approach on this problem, since the reduction of the back- 
ground knowledge leads to a reduction of the learning time, without any loss of 
accuracy. 

In section 2, a more detailed presentation of the ILP framework is given, 
together with the principles of the RST-based learning approach. This latter is 
then described more precisely in sections 3, 4 and 5. In particular, it is shown 
how RST’s basic notions, namely indiscernibility as well as the lower and upper 
approximations of a concept, have been defined in the context of ILP. Einally, 
experimentations conducted with the EAGLE system are reported and discussed 
in section 6. 

2 Introduction of RST within an ILP framework 

In ILP, the language used to represent the knowledge is the Eirst Order Logic. 
Each concept C is basically defined by a boolean predicate G(xi, . . . , where 
each variable xi can be assigned either a nominal value (e.g., an identifier stand- 
ing for a domain entity), or a qualitative value (e.g., a constant). Thus, ground 
examples of a concept C consists of a collection of n-ary relational tuples of 
the form C(ti, . . . , t^), where each function-free term G is a possible assignment 
for variable Xi (nominal or qualitative). Eor instance, Height(Owen, Small) and 
Height(Garp,High) are examples of the Height(x,y) concept, x and y being respec- 
tively nominal and qualitative variables. Each example is labeled as positive ( 0 ) 
or negative ( 0 ), depending on its membership to the concept. 
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Given a target concept, a training set composed of its labeled examples, as 
well as a background knowledge including examples of other concepts, the goal of 
ILP can be stated as the induction of a logical description of the target concept, 
which defines it in terms of other concepts of the background knowledge. A 
description is a set of definite Horn clauses, which come in the following form: 

^Target ^ ^ ^ 

Head Body 

Each Li is a literal corresponding to a concept Q which includes variables or 
constants. For instance, Height(x,High) ^ — Parent(x,y) A Height(y,High) is a 
possible description of the Height(x,y) concept. The learned description must 
characterize most positive but least negative training examples of the target 
concept in the given background knowledge. 

In accordance with RST, the approach proposes three different steps to 
achieve the learning goal. The first one performs a partitioning of the train- 
ing set and the background knowledge, according to an indiscernihility relation. 
The resulting knowledge consists of a collection of equivalence classes, also called 
granules of concepts, grouping together indistinct examples. Then, a second step 
computes the lower and upper approximations of the target concept. Both ap- 
proximations are defined as portions of the background knowledge, which include 
respectively the minimal (lower approximation) and maximal (upper approxima- 
tion) amounts of information about the target concept. They are computed from 
granules, according to criteria related to the nominal information of the target 
concept. Thus, the underlying goal of this approximation step is the reduction 
of the data which will be used for inducing the description. Finally, the third 
step consists in inducing the searched description from a chosen approximation, 
either lower or upper one. Thus, this last step is the one which explores the 
search space. The use of the resulting description through the classification of 
test examples, i.e., examples which did not take part to the learning process, 
allows for evaluating its accuracy. 

3 Partitioning of knowledge and granularity of concepts 

In the ILP framework of the approach, the RST’s basic notion of indiscernibility 
is an equivalence relation i, which is defined on (1) the representation of the 
same concept, (2) the equality of the label (0 or 0 ), and (3) the equality of 
constant terms. 

Definition 1. Let : P{ti ^ ... ,tp) andT 2 : Q(ti, ... ,tg) examples^ 

andT 2 are indiscernible according to denoted [Ti I 7 2 ); if and only if: 

(1) ^ =Q 

(2) Label (Ti) = Label {T 2 ) 

(3) \/ti G 7i, Vtj G 'i 2 such as C onstant(ti) and C onstant(tj) ^ i = 3 => A = 

Partitioning the knowledge consists in grouping together into granules exam- 
ples which are indiscernible according to i, i.e., the information they describe 
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is insufficient to distinguish them. Granules including positive (resp. negative) 
examples are called positive (resp. negative) granules. After this partitioning of 
ground examples, the representation of a concept consists of the collection of pos- 
itive and negative granules including its examples. The granularity of a concept, 
i.e., its number of granules, varies from a concept to another. Indeed, it results 
from the above definition of the indiscernibility relation that the examples of 
concepts including constant terms are distributed among several granules, de- 
pending on the different possible values for the constant terms. On the contrary, 
concepts whose examples contain exclusively nominal terms are represented by 
only two granules, including respectively their positive and negative examples. 

4 Approximation of concepts 

In the context of ILP, the lower and upper approximations of a concept are 
well-chosen portions of the background knowledge, which are defined according 
to some criteria based on the nominal information of this concept. 

Definition 2. The nominal information of a concept C ^ denoted Nom[C)^ is 
the set of nominal terms occuring in its positive or negative examples: Nom[C) 
= {ti / Nominalfti) and 3C(. . . ,G, . . . ) G [C]®U[C]^} where [C]® (resp, [C]^) 
stands for the set of positive (resp. negative) granules representing C . 

Since nominal terms designate domain entities, the nominal information of a con- 
cept thus consists of the set of entities which appear in its representation. Given 
this information, the lower and upper approximations are defined as follows: 

Definition 3. The lower approximation of a concept C ^ denoted RC ^ is the set 
of positive and negative examples of the background knowledge^ whose nominal 
terms are included in or equal to the nominal information of C : RC = {Ti / 
NTfTi) C N om[C)} ^ where NTfR) stands for the set of nominal terms occur- 
ring in example R. 

Definition 4. The upper approximation of a concept C ^ denoted RC ^ is the 
set of positive and negative examples of the background knowledge^ whose nom- 
inal terms intersect with the nominal information of C : RC = {R / NTfR) fl 
Nom{C) 7^ 0}. 

Actually, approximating a concept consists in selecting, into each granule of the 
background knowledge, examples which share common information with this 
concept. Indeed, the lower (resp. upper) approximation is the portion of the 
background knowledge whose nominal information is identical (resp. close) to 
the one of the concept. Both represent respectively the minimal and the maxi- 
mal amounts of data available about the target concept. Indeed, it follows from 
the previous definitions that both approximations verify the following property: 
RC C RC^ i.e., the lower approximation of a concept is a subset of its upper 
approximation. Thus, the approximation step reduces the volume of the back- 
ground knowledge which will be used for inducing the searched description. This 
reduction does not modify the granularity of the data, but the size of the gran- 
ules. 
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5 Induction of logical descriptions of concepts 

The problem of inducing a description of a target concept during the last induc- 
tion step can be formalized as follows. Given 

• the set of target concept’s positive granules: [Target]® ^ 

• the set of target concept’s negative granules: [Target]® ^ 

• the target concept’s chosen approximation: either lower one RTarget or up- 
per one RTarget^ 

Find a description, i.e., a set of clauses, which characterizes most positive exam- 
ples (in [Target]®) but least negative ones (in [Target]®) of the target concept. 
To address this goal, an inductive learning algorithm which overlaps the one 
of the FOIL system [13] is performed. Since positive training examples may 
be distributed among several granules, consequently to the partitioning of the 
knowledge, several stages are required in order to find a suitable description. 
Indeed, positive examples coming from distinct granules can not be handled at 
the same time, since they are not indiscernible. Thus, in order to respect the 
granularity of the target concept, each stage deals with a single positive granule 
of [Target]® and consists in finding a subset of clauses which characterize all its 
included examples but no negative one. In its initial state, each clause h which 
is searched has an empty body: LTarget ^ — ?• The head literal LTarget is a gen- 
eralized form of the examples included in the current positive granule, obtained 
by replacing nominal terms by distinct variables and constant terms by the cor- 
responding value. Then, the construction of h consists in adding literals one by 
one to its body. Before any addition, the space of candidate literals comprises all 
the possible generalized forms of granules which can be obtained from the chosen 
approximation, by replacing nominal terms with bound variables, i.e., variables 
which already appear in the clause, but also new ones. In order to prevent from 
a large space, each candidate literal must contain at least one bound variable. 
Then, the literal that is chosen among all possible candidates to be added to the 
body of is the one which has the maximal gain^ as it is defined in the FOIL 
system [13]. This heuristic allows for choosing the best literal, according to the 
number of positive and negative training examples which are characterized by 
the resulting clause. The addition of literals to the clause’s body stops when one 
of the three following stopping criteria is verified: (1) the clause in progress is 
complete, i.e., it characterizes no more negative example in the training set, (2) 
all the candidate literals have a negative gain or (3) the number of literals in the 
body is greater than or equal to a maximal number that is fixed by the user. 

6 Experimental results 

This section reports an experimentation, performed with the EAGLE system, 
which is a concrete implementation of the approach. The chosen real-world 
dataset is related to the problem of document understanding, and can be found 
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at the MLnet Archive Through this experimentation, our purpose is to ana- 
lyze the effects of the background knowledge reduction, i.e., the approximation 
advocated by the approach, on the learning cost, i.e., the time spent to induce 
the descriptions, and the accuracies of the learned descriptions. The learning 
problem consists in discovering rules for identifying the logical components of 
a document (for instance sender, receiver, logo, reference or date), according to 
their layout information. This learning problem has already been the object of 
several studies, such as for example the learning of contextual rules, which can 
be found in [3]. The complete dataset contains the descriptions of 30 single page 
documents, which are letters sent by Olivetti. For each document, a descrip- 
tion consists of the list of its logical components, together with their respective 
layout features, which are expressed by means of various concepts. The labels 
of the logical components are indicated by the five following target concepts, 
namely sender(x), receiver(x), logo(x), ref(x) and date(x). Other concepts of the 
background knowledge, such as width_small(x), height_large(x), type_picture(x), 
position_center(x) and above(x,y), allow for expressing the components layout 
features as well as their relationships. 

The global experimentation on this dataset has been performed under a 6- 
cross validation protocol, which means that 6 experiments have been conducted. 
At each experiment, a sample of labeled components coming from a random 
subset of 20 documents is used as training examples. Other components coming 
from the remaining 10 documents are used as test examples. In both samples, 
only positive examples of each target concept are given. Negative examples are 
derived from positive ones, by considering that for each target concept, the 
components which have a different label are its negative examples. For instance, 
negative examples of senders are components which are labeled as receiver, logo, 
reference or date. 

Each experiment consists in inducing definitions for each target concept, from 
the complete background knowledge, but also from the lower and upper approx- 
imations, which represent respectively (on average on the six experiments and 
for each target concept) 60% and 66.3% of the whole background knowledge. 
For each target concept, the classification accuracy of its induced description is 
computed from the test examples, as the number of examples which match the 
description (i.e., the number of 0 examples which are characterized by the de- 
scription, added to the number of 0 examples which are not characterized by the 
description), divided by the total number of test examples. The results reported 
on table 1 are averages of the results obtained during the six experiments. These 
latter comprise the classification accuracies of the learned descriptions, as well as 
the learning times (in seconds), according to the portion of background knowl- 
edge used (complete, lower or upper). In case of a learning from the complete 
background knowledge, the learning time corresponds only to the time spent by 
the induction step, since no approximation is computed. In case of a learning 
from lower or upper approximation, it additionally comprises the time spent by 



^ http:/ /www.gmd.de/ml-archive/frames/datasets/datasets-frames.html 
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the approximation step. By comparison, the accuracies obtained by the FOCL 
[10] system for each target concept are also given. 
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Receiver 
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(a) EAGLE (b) FOCL 



Fig. 1. Experimental results 



A first observation is that the accuracies obtained by EAGLE are identical, 
whatever the portion of the background knowledge (lower, upper, complete) used 
for learning. These latter are better by comparison with the ones obtained by 
FOCL. It appears from the table that learning from the lower approximation is 
more interesting than using the whole background knowledge, since it reduces 
the learning time without any loss of accuracy. Actually, since an approximation 
is only a portion of the background knowledge, the time spent by the system for 
inducing descriptions is reduced. In particular, the evaluation of each candidate 
literal’s gain is faster because there are less possible bindings for the variables. 
The time used for computing an approximation, which is on average equal to 
0.38 seconds, appears to be insignificant by comparison with the time which is 
further saved during the inductive step. This phenomenon is observable on each 
target concept, however it is less obvious for Logo(x). Indeed, for this concept 
the learning goal is quite easy to achieve, i.e., EAGLE finds a description in a 
short time, at each experiment, from any portion of the background knowledge. 
This latter is the following: Logo(x) ^ — Type_picture(x). As a consequence, the 
approximation step consumes time without significantly decreasing the time of 
induction. In that case, the search can straightforwardly be performed on the 
complete background knowledge. On the contrary, in case of concepts such as 
Sender(x), Receiver(x), Ref(x) and Date(x), the learning goal is more difficult to 
achieve, since a description with several clauses (5 on average) including more 
literals in their bodies (4 on average) are searched. In that case, computing an 
approximation is interesting since it reduces significantly the learning time. For 
example, on concept Receiver, the gain of time reaches approximatively 56%. 

7 Conclusion 

In this paper, we have pointed out the problem of handling large amount of 
data faced by inductive learning systems, especially those which fit into a logical 
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framework. In order to cope with this problem, we have proposed a learning 
approach, which introduces and adapts the basic notions of Rough Set Theory 
within the paradigm of Inductive Logic Programming. In particular, it argues in 
favor of performing induction from only well-chosen portions of the background 
knowledge, which represent approximations of the concept to learn. Empirical 
results obtained with a first concrete implementation of the proposed approach, 
show that the approximation allows for reducing the learning cost without losing 
the learning results accuracy. The current direction of research aims at improving 
the flexibility and the adaptability of the approach to various learning problems, 
and especially various forms of data. In particular, an extension of this approach, 
which provides a suitable framework for handling uncertain data and inducing 
flexible concepts, has already been proposed and evaluated in [11,12,4]. 
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Abstract. lu this paper we out Hue the desigu of a RDBMS that will pro- 
vide the user with traditioual query capabilities as well as KDD queries. 
Our approach is uot just auother system which adds KDD capabilities, 
this desigu is aimed to iutegrate these KDD capabilities iuto RDBMS 
core. The approach also defiues a geueric eugiue of Data Miuiug algo- 
rithms that allows easy euhaucemeut of system capabilities as a uew 
algorithm is implemeuted. 



1 Introduction 

Most of the KDD systems that have been implemented up to the present mo- 
ment apply just one particular methodology or implement a particular algorithm 
(rough sets [9], attribute-induction [5], a priori[l,2]). When designing this archi- 
tecture we wanted a system that integrates data mining capabilities within the 
RDBMS. We wanted the system to be extensible, that is, we wanted to build 
a system in which adding new algorithms would be easy. This goal is achieved 
dividing KDD algorithms into basic operations that will be implemented as par- 
ticular instance of a structure that has been called operators. 

The paper is organized as follows: The division of KDD algorithms into basic 
operations is explained in section 2 as well as the main structure that operators 
must have in order to be included in the system. Extension of main modules of 
traditional database systems to handle new operations is discussed in section 3. 

2 Algorithms 

It is easy to observe that many KDD algorithms have similar behavior during, 
at least, an important part of their execution. This fact has led us to consider 
the division of the algorithms into several parts, in order to achieve reusability 
of code. Moreover, as one of the goals of the design is the integration with a 
RDBMS, we have tried to make each of those parts as similar to RDBMS basic 
operations as possible. In our desing the basic operations will be performed by 
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the operator structure. We will call operator any operation that is made up of: 
Relational tables both as Input and Output, Auxiliary Structures that will 
be used to keep input /output information of the operation and Parameters 
than guide their behavior. The main result of this process could be represented 
as Extracted Information. In figure 1 the basic structure of an operator is 
depicted. A data mining query will then be defined as the sequence of operators 



Operator Paramaters 



Extracted Information 



Input Objects 




Fig. 1. Operator interface 



that, given a particular table containing certain objects, gets as a result the set 
of patterns to describe the knowledge asked by such a query. 

3 Architecture 

The decomposition of algorithms into operators guided the design of the system 
we plan to build. The identification of these simple components could be used to 
split most KDD queries into atomic elements that could be managed directly by a 
new RDBMS. This system may be named as RDBMAS (Relational DataBase 
Management and Analysis System) and would be designed (see figure 2) as an 
extension of traditional RDBMS with new capabilities. 

In figure 2 we show how a generic traditional RDBMS would have to be modified. 
As a result next generation of database systems would provide the user with 
traditional queries as well as the possibility to analyze data. 

— Query Analyzer: This module parses each submitted query to the system 
and translates it into a internal representation. In traditional RDBMS this 
module analyzes only SQL commands, for RDBMAS it will have to analyze 
also new KDD sentences. 

— Optimizer: This component gets internal sentence representation supplied 
by the previous module and optimize the order of execution of its clauses. 
As a result returns a specific execution plan for the user query. If new oper- 
ations provide the optimizer with the same measures traditional operations 
do (weights, restrictions, ... ) this module would not have to be changed 
significantly. 
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Fig. 2. Traditional RDBMS and proposed RDBMAS 



— Engine: This module carries out execution plan. In RDBMAS this execution 
engine must be able to complete SQL sentences and new KDD queries. 

— Loader: In order to achieve KDD queries some new code must be executed 
in the RDBMAS. Almost all KDD systems provide their functionalities with 
static defined code. In this system all algorithms must be split into elemental 
operations and may be implemented in external dynamically-loaded compo- 
nents (operators). 

— Catalog: Information about data (metadata) is stored as additional tables 
for RDBMS. This information is required for management functions, but a 
new information about data may also be necessary in order to support anal- 
ysis (KDD capabilities) functions. So, the system catalog must be extended. 



3.1 Comparison with other systems 

Approaches like Data Surveyor [6] propose similar system architectures that 
modify RDBMS in order to achieve better performance, but in most cases KDD 
algorithms run outside of these enhanced RDBMS. With our design all queries 
(SQL queries and KDD queries) are managed by the same RDBMAS. 

RSDM [4] has been conceived as an engine of KDD algorithms instead of a sys- 
tem that adds some particular capabilities. This approach has its advantages 
as well as disadvantages. On the one hand, the idea of building an engine of 
algorithms in contrast to all the existing Data Mining systems, will allow to 
add new capabilities with the only task of building the module that will execute 
such capability. This avoids the complex process of codifying programs for an 
integrated system in which you have to care not only of the coding of the algo- 
rithm but of the communication, storing of intermediate results and so on. On 
the other hand, the process of construction of the architecture is more complex. 
However, we must emphasize once again that adding any capability will be a 
straightforward task once the architecture has been finished. 
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3.2 Conclusion and future research 

The design of a new generation of database systems that will provide the user 
with query and analysis of data has been outlined in this paper. We will called 
these system RDMAS. 

At the present moment the design is being further studied to tackle the problems 
that are arising as a result of addition the new capabilities. Also implementa- 
tion of a first prototype has just began. We hope to have in the near future a 
prototype available. 

We have to remark once again that the implementation of the operation inside 
the core of the RDBMS is twofold on the one hand efficiency gain due to max- 
imization of optimizer functions on the other next generation of RDBMS will 
provide users with data analysis capabilities. 

Acknowledgments 

We are very much indebted for inspiration to Dr. Ziarko and Dr. Pawlak. 



References 

1. Agrawal R., Imielinski T., Swami A. Mining association rules between sets of Item 
in large Databases, Proceedings of ACM SIGMOD, pp. 207-216, May 1993. 

2. R. Agrawal, Mining Association Rules Between Sets of Items in Large Databases, 
In Prooceedings of ACM SIGMOD Int. Conf. on Management of data, Washington 
DC, pp. 207-216, 1993. 

3. R. Agrawal et ah. The Quest Data Mining System In Proceedings The Second Int. 
Conf. on Knowledge discovery and Data Mining, pp. 244-249. August 1996 

4. M. Fernandez-Baizan, E. Menasalvas, J.M. Pena. Integrating RDBMS and Rough 
Set Theory To Appear in Fuzzy Databases in August 1998 

5. J. Han et ah, DB Miner: A System for mining knowledge in Large Relational 
Databases In Proceedings The Second Int. Conf. on Knowledge discovery and 
Data Mining, pp. 250-255, August 1996 

6. M.L. Kersten, A.P.J.M. Siebes, Data Surveyor: Searching the nuggets in parallel 
Advances in Knowledge Discovery and Data Mining, AAAI Press, pp. 447-467 

7. J. Komorowski, A. Ohrn, ROSETTA: A Rough Set Toolkit for Analysis of Data In 
Proceedings JCIS 97, pp. 403-407. March 1997 

8. Z. Pawlak, Rough Sets - Theoretical Aspects of Reasoning about Data, Kluwer, 1991. 

9. Z. Pawlak, Information Systems- Theoretical foundations. Information systems, 6, 
No.4, 1993, pp. 299-297. 

10. G. Piatesky-Shaphiro, An Overview of Knowledge Discovery in Databases: Recent 
Progress and Challenges, Rough Sets, Fuzzy Sets and Knowledge Discovery, 1994, 

pp. 1-11 

11. A. Skowron, C. Rauszer, The Discernibility matrices and Functions in Information 
System, ICS PAS Report 1/91, Technical University of Warsaw 1991, pp. 1-44 

12. W. Ziarko, Variable Precision Rough Sets Model Journal of Computer and System 
Sciences, vol. 46. 1993, 39-59. 

13. W. Ziarko, N. Shan On Discovery of Attribute Interactions and Domain Classi- 
ficmctions, CSC ’95 23 Annual Computer Science Conf. on Rough Sets and Data 
Mining 




Fast Discovery of Representative Association Rules 



Marzena Kryszkiewicz 

Institute of Computer Seienee, Warsaw University of Teehnology 
Nowowiejska 15/19, 00-665 Warsaw, Poland 
mkr@ii.pw.edu.pl 



Abstract. Diseovering assoeiation rules among items in a large database is an 
important database mining problem. The number of assoeiation rules may be 
huge. To alleviate this problem, we introdueed in [1] a notion of representative 
assoeiation rules. Representative assoeiation rules are a least set of mles that 
eovers all assoeiation rules satisfying eertain user speeified eonstraints. The 
assoeiation rules, whieh are not representative ones, may be generated by means 
of a eover operator without aeeessing a database. In this paper, we investigate 
properties of representative assoeiation rules and offer a new effieient algorithm 
eomputing sueh rules. 



1 Introduction 

Discovering association mles among items in large databases is recognized as an 
important database mining problem. The problem was introduced in [2] for sales 
transaction database. The association mles identify sets of items that are purchased 
together with other sets of items. For example, an association rule may state that 90% 
of customers who buy butter and bread buy also milk. Several extensions of the notion 
of an association mle were offered in the literature (see e.g. [3-4]). One of such 
extensions is a generalized mle that can be discovered from a taxonomic database [3]. 
Applications for association mles range from decision support to telecommunications 
alarm diagnosis and prediction [5-6]. 

The number of association rules is usually huge. A user should not be presented 
with all of them, but rather with these which are original, novel, interesting. There 
were proposed several definitions of what is an interesting association mle (see e.g. 
[3,7]). In particular, pmning out uninteresting mles which exploits the information in 
taxonomies seems to be quite useful (resulting in the rule number reduction amounting 
to 60% [3]). The interestingness of a mle is usually expressed by some quantitative 
measure. In [I] we offered a different approach. We did not introduce any measure 
defining interestingness of a rule, but we showed how to derive the set of association 
mles from a given association mle by means of a cover operator without accessing a 
database. A least set of association mles that allows to deduce all other mles satisfying 
user specified constraints is called a set of representative association mles. In [I], it 
was offered the GenAllRepresentatives algorithm computing representative 
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Copyright Springer-Verlag Berlin Heidelberg 1998 




Fast Discovery of Representative Assoeiation Rules 215 



association rules. To check whether a candidate rule is representative the algorithm 
required comparing the rule with longer representative rules, which was quite time- 
consuming operation. In this paper, we investigate some properties of representative 
association rules that allow us to propose a new efficient algorithm for representative 
association rules mining. The new algorithm generates representative rules 
independently from other representative rules. 



2 Association Rules 

The definition of a class of regularities called association rules and the problem of 
their discovering were introduced in [2]. Here, we describe this problem after [2,8]. 
Let /= {z‘i, z‘ 2 , ..., ij be a set of distinct literals, called items. In general, any set of 
items is called an itemset. Let D be a set of transactions, where each transaction T is a 
set of items such that T e /. An association rule is an expression of the form Y, 
where 0^X,YczI and Xn 7=0. X is called the antecedent and 7 is called the 
consequent of the rule. 

Statistical significance of an itemset X is called support and is denoted by sup{X). 
Sup{X) is defined as the number of transactions in D that contain X. Statistical 
significance {support) of a rule X^ Y is denoted by sup{X ^ 7) and defined as 
sup{X u 7). Additionally, an association rule is characterized by confidence, which 
expresses its strength. The confidence of an association rule X^ Y is denoted by 
confX ^ 7) and defined as the ratio sup{X u 7) / sup{X). 

The problem of mining association rules is to generate all rules that have support 
greater than some user specified minimum support v > 0 and confidence not less than a 
user specified minimum confidence c > 0. In the sequel, the set of all association rules 
whose support is greater than v and confidence is not less than c will be denoted by 
AR(s,c). If V and c are understood then AR(s,c) will be denoted by AR. 

In the paper, we apply also the following simple notions: 

The number of items in an itemset will be called the length of the itemset. An 
itemset of the length k will be referred to as a k-itemset. Similarly, the length of an 
association rule X^Y will be defined as the total number of items in the rule’s 
antecedent and consequent (\Xkj 7|). An association rule of the length k will be 
referred to as a k-rule. An association ^-rule will be called shorter than, longer than or 
of the same length as an association zzz-rule if k < m, k > m, or k = m, respectively. 



3 Cover Operator 

A notion of a cover operator was introduced in [1] for deriving a set of association 
rules from a given association rule without accessing a database. 

The cover C of the rule X^ Y,Y ^0f\s defined as follows: 

C{X^Y) = {XkjZ^ f]Z,Le7andZnL=0and Vit0}, 
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Each rule in C(X Y) consists of a subset of items occurring in the rule Y. 
The antecedent of any rule r covered by X^ Y contains X and perhaps some items 
from Y, whereas r’s consequent is a non-empty subset of the remaining items in Y. It 
was proved in [1] that each rule r in the cover C(r’), where r’ is an association rule 
having support ^ and confidence c, belongs in AR(s,c). Hence, if r belongs in AR(s,c) 
then every rule r’ in C(r) also belongs in AR{s,c). The number of different rules in the 
cover of the association rule X^ Yis equal to S'” - 2'”, where m= \ Y\ (see [1]). 

Example 3.1 

Let = {A,B,C,D,E}, = {A,B,QD,E,F}, = {A,B,C,D,E,HJ}, = {A,B,E} and 

= {B,C,D,E,HJ} are the only transactions in the database D. Let r: {AB CDE). 
Lig. 1 contains all rules belonging in the cover C(r) along with their support and 
confidence in D. The support of r is equal to 3 and its confidence is equal to 75%. The 
support and confidence of all other rules in C(r) are not less than the support and 
confidence of r. 



# 


Rule r’ in C(r) 


Support of r’ 


Confidence of r’ 


1. 


AB CDE 


3 


75% 


2. 


AB^CD 


3 


75% 


3. 


AB^CE 


3 


75% 


4. 


AB^DE 


3 


75% 


5. 


AB^C 


3 


75% 


6. 


rr 


3 


75% 


7. 


AB^E 


4 


100% 


8. 


ABC DE 


3 


100% 


9. 


ABC^D 


3 


100% 


19. 


ABC^E 


3 


100% 


11. 


ABD CE 


3 


100% 


12. 


ABD^C 


3 


100% 


13. 


ABD^E 


3 


100% 


14. 


ABE CD 


3 


75% 


15. 


ABE^C 


3 


75% 


16. 


ABE^D 


3 


75% 


17. 


ABCD E 


3 


100% 


18. 


ABCE => D 


3 


100% 


19. 


ABDE C 


3 


100% 



Fig. 1. The cover of the association rule r: {AB ^ CDE) 



Below, we present two simple properties, which will be used further in the paper. 

Property 3.1 

Let r.{X^Y) and r’: (Jf ^ F) be association rules. Then: 

rGC(r’) iffXuTeX’uF andX^X. 

Property 3.2 

(i) If an association rule r is longer than an association rule r’ then C(r’). 

(ii) If an association rule r.{X^Y) is shorter than an association rule E\{X ^T) 
then rGC(r’) iffXuTeX’uF andX^X. 
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(iii) If r.{X^Y) and r'\ {T ^ T) are different association rules of the same length 
then rGC(r’) iffXu7=X’ur andXz^T. 



4 Representative Association Rules 

In this section we describe a notion of representative association rules which was 
introduced in [1]. Informally speaking, a set of all representative association rules is a 
least set of rules that covers all association rules by means of the cover operator. 

A set of representative association rules wrt. minimum support v and minimum 
confidence c will be denoted by RR{s,c) and defined as follows: 

RR{s,c) - {r^AR{s,c)\ —3r'^AR(s,c), rVr and re C(r’)}. 

If V and c are understood than RR{s,c) will be denoted by RR. Each rule in RR is 
called a representative association rule. By the definition of RR no representative 
association rule may belong in the cover of another association rule. 

Example 4.1 

Given minimum support v = 3 and minimum confidence c = 75%, the following 
representative rules RR{s,c) would be found for the database D from Example 3.1 : 

{A^BCDE, C^ABDE, D^ABCE, B^CDE, E^BCD, B^AE, E^AB}. 

There are 7 representative association rules in RR(s,c), whereas the number of all 
association rules in AR(s,c) is 165. Hence, the representative association rules 
constitute 4.24% of all association rules. 

We may expect that a user will often request the set of representative association 
rules RR rather than the set of all association rules AR. If RR is provided then the user 
may formulate queries about the association rules represented by RR. Clearly, AR(s,c) 
= U{C(r)| reRR(s,c)}. However, we expect the user to ask rather about the covers of 
specific representative rules. The queries might contain not only the cover operator, 
but also the set-theoretical operators of union, difference and intersection. 



5 The Algorithm 

The problem of generating association rules is usually decomposed into two 
subproblems: 

1. Generate all itemsets whose support exceeds the minimum support v. The itemsets 
of this property are called frequent {large). 

2. From each frequent itemset generate association rules whose confidence is not less 
than the minimum confidence c. Let Z be a frequent itemset and 0^XczZ. Then 
any rule X ^ Z\X holds if sup{Z)/sup{X) > c. 
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In the paper we restrict the second subproblem to generation of all representative 
association rules whose confidence is not less than the minimum confidence c. 

Several efficient solutions were proposed to solve the first subproblem (see 
[3,8-9]). We will remind briefly the main idea of the Apriori algorithm [8] computing 
frequent itemsets. Then, we will propose a new efficient algorithm computing 
representative association rules from the found frequent itemsets. 



5.1 Computing Frequent Itemsets 

The Apriori algorithm exploits the following properties of frequent and non-frequent 
itemsets: All subsets of a frequent itemset are frequent and all supersets of a non- 
frequent itemset are non-frequent. The following notation is used in the Apriori 
algorithm: 

• Q - set of candidate ^-itemsets; 

• set of frequent ^-itemsets; 

The items in itemsets are assumed to be ordered lexicographically. Associated with 
each itemset is a count field to store the support for this itemset. 

Algorithm Apriori 

= {frequent 1-itemsets}; 
for {k = 2; ^ ^ 0; k++) do begin 

q = AprioriGen{F^_^) ; 
forall transactions T e D do 
forall candidates Z e q do 
if Z c T then 
Z. count++ ; 

Fj, = (Z G q I Z. count > s}; 
endf or ; 

return F^; 

First, the support of all 1-itemsets is determined during one pass over the database 
D. All non-frequent 1-itemsets are discarded. Then the loop “for“ starts. In general, 
some ^-th iteration of the loop consists of the following operations: 

1. AprioriGen is called to generate the candidate ^-itemsets Q from the frequent 
(^-l)-itemsets 

2. Supports for the candidate ^-itemsets are determined by a pass over the database. 

3. The candidate ^-itemsets that do not exceed the minimum support are discarded; 
the remaining ^-itemsets are found frequent. 

function AprioriGen (frequent (k-1) -itemsets F^_J ; 
insert into 

select (Z[l] , Z[2] , ... , Z[k-1] , Y[k-1]) from F^.^ Z, F^.^ Z 
where Z[l] = Y[l] a ... a Z[k-2] = Y[k-2] a Z[k-1] < Y[k-1] ; 
delete all itemsets Z g such that some {k-1) -subset of Z 
is not in F^_^; 
return C^; 

The AprioriGen function constructs candidate ^-itemsets as supersets of frequent 
(^-l)-itemsets. This restriction of extending only frequent (^-l)-itemsets is justified 
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since any ^-itemset, which would be created as a result of extending a non-frequent 
(^-l)-itemset, would not be frequent either. The last operation in the AprioriGen 
function prunes the candidates from Q that do not have all their (^-l)-subsets in the 
frequent (^-l)-itemsets If ^-itemset Z does not have all its (^-l)-subsets in ^ 
then there is some non-frequent (^-l)-itemset which is a subset of Z. This 

means that Z is non-frequent as a superset of a non-frequent itemset. 



5.2 Computing Representative Association Rules 

In this subsection we offer an efficient algorithm for computing representative 
association rules. Unlike the GenAllRepresentatives algorithm proposed in [1], the 
new FastGenAllRepresentatives algorithm exploits solely the information about the 
supports of frequent itemsets. The Apriori algorithm may be run to calculate all 
frequent itemsets and their supports. FastGenAllRepresentatives is based on Properties 
5.2. 1-5. 2.2, which we present and prove below. 

Lemma 5.2.1 

Let 0 ^X(z Z and r be an expression of the form (X ZXX). 

3Z’^, Z’z)Z and sup{Z') > s and sup{Z')l sup{X) > c iff 
3F^AR{s,c), F is longer than r and re C(r’). 

Proof: 

(^) Let r’ be an expression of the form: X ^ Z'\X. Clearly, r’ is longer than r 
since Z’z)Z. The rule F^AR{s,c) because X^0, the support sup{F) - sup{Z') > s and 
the confidence confer') = sup{Zysup(X)> c. Additionally, Property 3.1 allow us to 
conclude that re C(r’). 

(<^) Let Z’^ and r’ be an expression of the form: X^ Z'\X. By the assumption, 
FeAR(s,c). Hence, sup(F) = sup(Z') > s and confer') = sup(Zysup(X)> c. 
Additionally, r’ is longer than r and reC(F). Therefore, we can conclude from 
Property 3.2.ii that Z’z)Z. 

Lemma 5.2.2 

Let 0 ^X(z Z and r be an expression of the form (X ^ Z\X). Let maxSup - 
max{{sup{Z')\ ZeZ'^}u{0}). 

maxSup>s and maxSup! sup{X)>c iff 3F^AR{s,c), F is longer than r and re C(r’). 
Proof: Lemma 5.2.2 follows immediately from Lemma 5.2.1. 

Property 5.2.1 

Let 0^X(zZ^I and r be a rule: (X ZXX) g AR(s,c). The rule r belongs in RR(s,c) 
if the two following conditions are satisfied: 

(i) maxSup < v or maxSup! sup{X) < c, where 
maxSup - max{{sup{Z')\ ZeZ'^}u{0}), 
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(ii) -.ax’, 0 eX, such that (X ^ ZX) g AR(s,c). 

Proof: Property 3.2.i tells us that an association rule does not belong in the cover of 
any shorter rule. So, the association rule r is representative if it does not belong in the 
cover of any association rule different from r which is longer than r or which is of the 
same length as r. The first condition (i) guarantees that the rule r does not belong in 
the cover of any association rule longer than r (see Lemma 5.2.2). The second 
condition (ii) ensures that the rule r does not belong in the cover of any association 
rule of the same length as r (see Property 3.2.iii). 

Property 5.2.2 

Let 0 Z e Z’ e /. If sup(Z) - sup{Z') then no rule (X ^ 7SX) g AR{s,c), where 
0^X(zZ, belongs in RR(s,c). 

Proof: Let (X ^ Z\X) g AR(s,c). Then, 0^X, sup{Z) > s and confiX ^ ZX) > c. 
Now, let us consider a rule: X^Z’X. {X ^ Z'\X) ^ AR{s,c) because 0^X, 
sup{Z') - sup(Z) > s and conf(X Z’X) - sup{Z')l sup{X) - conf{X ^ ZX) > c. 
Additionally, (X ^ ZX)g C(X ^ Z’X). Hence, X ^ ZX is not representative. 

procedure FastGenAllRepresentatives frequent itemsets F) ; 

forall Z G F do begin 

k = |Z|; maxSup = max { {sup {Z' ) \ ZeZ'GF^^^} U {o}); 
if Z.sup ^ maxSup then begin // see Property 5.2.2 

= {{Z[l]}, {Z[2]}, ... , [Z[k]}}; II create 1-antecedents 

/* Loopl */ 

for (i = 1; (A ^ 0) and (i < k) i + +) do begin 

forall X G A do begin 
find YgF, such that Y = X; 

XCount = Y. count; 

I* Is X ^ Z\X an association rule? */ 
if {Z . count I XCount > c) then begin 
/* Aren't there representatives longer than X => Z\X? */ 
if {maxSup I XCount < c) then // see Property 5.2.1.i 
print (X, Z\X, " with support: ", Z . count , 

" and confidence: ", Z . count / XCount); 

I* Antecedents of association rules are not extended */ 

A, = A, \ {x}; // see Property 5.2.1.ii 

endif ; 
endf or ; 

A,^^ = AprioriGeniAj ; // compute i + l-antecedents 

endf or ; 
endif ; 
endf or ; 
endproc ; 

The FastGenAllRepresentatives algorithm computes representative association 
rules from each itemset in F. Let Zbe a considered itemset in F. Only ^-rules, k=\Z\, 
are generated from Z. First, maxSup is determined as a maximum from the supports of 
these itemsets in which are supersets of Z. If there is no superset of Z in then 
maxSup=0. Let us note that the supports of other proper supersets of Z, which do not 
belong in are not greater than maxSup. Clearly, maxSup>s or maxSup=0. If sup{Z) 
is the same as maxSup then no representative rule can be generated from Z (see 
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Property 5.2.2). Otherwise, single-item antecedents of candidate ^-rules are created. 
Loopl starts. In general, the z-th iteration of Loop 1 looks as follows: 

Each candidate X ZX, where XczZ belongs in z-itemsets A., is considered. Z is 
frequent, so X, which is a subset of Z, is also frequent. In order to check if X^ ZX is 
an association rule its confidence: sup(Z)/sup(X) has to be determined. 

sup{Z)=Z. count, while sup{X) is computed as sup{Y) of a frequent itemset Y in F. such 
that Y=X. Only association rules that satisfy both conditions of Property 5.2.1 are 
representative. Condition (ii) is satisfied for any antecedent X which is 1 -itemset. 
Proper generating of antecedents makes this condition true also for consequent sets A.. 
So, in order to state whether an association rule is representative it is enough to check 
if condition (i) of Property 5.2.1 holds, i.e. whether maxSup<s or maxSup/sup{X)<c. If 
maxSup=0 then both subconditions are satisfied, so X^ ZX is representative. 
Otherwise maxSup>s, which means that the letter subcondition will decide if X^ ZX 
is representative. The antecedent X of each association rule X ^ ZX is removed from 
A.. Having found all representative ^-rules with /-antecedents from Z, (z+l)-itemset 
antecedents A.^^ are built from A. by the AprioriGen function. In the result does 
not contain any itemset X such that X ^ ZX would belong in the cover of another 
association rule X ^ ZAX such that czX. Therefore statement (ii) of Property 5.2.1 
is an invariant of the algorithm. 



6 Conclusions 

In the paper we investigated properties of association rules that allowed us to construct 
an efficient algorithm computing representative association rules. Unlike the algorithm 
proposed in [1], the new algorithm exploits solely the information about the supports 
of frequent itemsets. 
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[ Abstract.] We present modification of the ProbRough algo- 
rithm for inducing decision rules from data. The generated 
rough classifiers are now sensitive to costs varying from object 
to object in the training data. The individual costs are repre- 
sented by new cost attributes defined for every single decision. 
In this approach the decision attribute is dispensable. Grouping 
of objects and defining prior probabilities are made on the ba- 
sis of the group attribute. Values of this attribute may have no 
relations with the decisions. The proposed approach is a gen- 
eralization of the methodology incorporating the cost matrix. 
Behavior of the algorithm is illustrated on the data concerning 
the credit evaluation task. 



1 Introduction 

In supervised machine learning a classification model is constructed inductively 
by exploration of a number of objects representing several classes and general- 
ization from these objects. The search process in a space of models (e.g. decision 
trees or rulesets) is usually directed by maximizing the classification accuracy, 
measured by the percentage of new test objects correctly classified. The use 
of the classification accuracy criterion tacitly assumes that the distribution of 
classes in the real world is the same as in the training data, and the costs of mis- 
classification are equal. This is rather a rare situation in real-life classification 
tasks (Provost and Fawcett, 1997). In such tasks it is often more appropriate 
to reduce the cost of misclassified objects (Pazzani et ah, 1994). The reason is 
that it frequently costs more to make one kind of classification error than the 
other one. For example, in a credit evaluation task, it rather costs more to de- 
termine that a credit applicant will be paying debts back, when in reality he is 
a defaulter; than to establish that a client with a good credit record will default 
on payments. There are some learning algorithms that offer supports for clas- 
sification tasks with unequal misclassification costs. However, in some real-life 
problems misclassification costs vary from object to object. Unfortunately, by 
now this new challenge has no satisfactory solution (Ezawa et ah, 1996). 
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In this paper we present a modified version of the ProbRongh algorithm 
(Piasta and Lenarcik; 1996, 1998). A source of inspiration for a construction of 
ProbRongh was the rough set theory (Pawlak, 1991). Rough classifiers generated 
by the modified ProbRongh algorithm are sensitive to costs varying from object 
to object. The individual costs are represented by new cost attributes defined for 
every single decision. The decision attribute is not necessary in this approach. 
Grouping of objects and defining prior probabilities are made on the basis of 
the group attribute. The values of this attribute may have no relations with the 
decisions. 

Modification of the algorithm is primarly concerned with the way of comput- 
ing values of the cost criterion. Also a method of determining a set of interme- 
diate values for continuous attributes is modified. 

The proposed approach is more general than the methodology incorporating 
the cost matrix. Behavior of the algorithm is illustrated on the data concerning 
the credit evaluation task. 

2 Learning task 

In our paper we discuss the problem of inducing the decision rules of the condition- 
decision type from a given training sample of objects. The training sample is a 
subset of a given universe of objects. 

Our goal is to minimize costs of making a decision about every new object by 
using the induced ruleset. The condition parts of the rules contain information 
about the values of condition attributes, characterizing objects. These attributes 
are denoted by Ai, A 2 , . . . , A^. We consider a finite set of I possible decisions. 
For each single decision, we define the cost attribute. The value of this attribute 
for an object is equal to the cost of relating the decision with the object. The 
cost attributes are denoted by Ti , T 2 , • • • , M • 

Every condition attribute has its own specific set of values that determines 
the type of the attribute: discrete unordered, discrete ordered, or continuous one. 
The cost attributes are number- valued. We denote elements of the value sets 
for condition and cost attributes by small letters: xi, . . . , Xm and ?/i, ?/ 2 , • • • , ?//? 
respectively. A condition attribute space is a collection of all possible m-tuples 
(xi,...,x^) of condition attribute values. Any object characterized by given 
values of condition attributes has its unique position in the condition attribute 
space. 

The training data is a finite set of objects with known values of attributes. 
The information about these objects has a form of a table. There are two general 
approaches to collecting the training data. In the first approach the training 
data is a representative sample of the whole universe, obtained, for example, by 
random sampling, in the second approach we assume that the whole universe 
is partitioned into disjoint groups. The true proportions of objects representing 
these groups have to be known and given as the prior prohahilities. In this case, 
the training data is composed of learning objects that are taken at random only 
within the groups. In this approach it is convenient to use a group attribute. 
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We denote this attribute by t?, and its values by g. The second approach to 
collecting training data is more general and therefore in the sequel we take into 
consideration only this case. All formulae for the first case can be easily obtained 
by substituting the prior probability of the group g by the ratio 

number of learning objects related to g 
number of all learning objects 

Let us notice that we can assign a set of decisions to every learning object in 
a natural way by taking these decisions that correspond to the minimum cost. 
In this way we can obtain a multivalued substitute of the decision attribute 
which is typical for classification tasks. It is worth stressing that generally, in 
the proposed approach, the group attribute and the decision attribute can play 
quite different roles. 

The cost criterion presented in the next section refers to the estimates of the 
probability that a new object from the universe belongs to different subsets of the 
condition attribute space. For any subset A of the condition attribute space, we 
denote the event that an object taken at random from the universe belongs to A 
by X G Z\. Similarly, for any group G = ^ is the event that an object taken at 
random from the universe comes from the group g. We denote the probabilities 
of the above events by P(X G A) and F{G = ^), respectively. We use the symbol 
P(X G A^G = g) to represent the probability that the both events will occur 
simultaneously. To get the conditional probability P(X G A\G = g) that event 
X G occurs given that event G = g occurs, we divide the probability that 
both events occur by the probability that the second event occurs, i.e. 

F{X e A\G = g)=F{XeA,G = g)/P{G = g). (1) 

Let us notice that the probability P{G = g) is equal to the prior probability 
corresponding to the group g. Since the learning objects are taken at random 
within the groups, the conditional probability P(X G A\G = g) can be estimated 
by the ratio 

number of learning objects being in A that come from the group g 
number of all learning objects from the group g 

The probability F(X. G A^G = g) is estimated by a product of the ratio (2) and 
P{G = g). ^ 

The learning algorithm tries to find an optimum partition of the condition 
attribute space by minimizing the criterion value. The resulting partition can 
be described in a compact way by a set of simple decisions rules, if we allow 
partition elements of the special form only. We need some additional notions 
to define the form of partition elements. By a segment Aq in the value set of 
the attribute Xq we mean an interval when the values of the attribute Xq are 
ordered, or an arbitrary set otherwise. By a feasible subset A in the condition 
attribute space, determined by segments Z\i, Z\ 2 , . . . , Am^ corresponding to the 
condition attributes, we mean a collection of all m-tuples (xi,X 2 , . . . xi G 

Z\l, X2 G Z\2, . . . , e 
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An assignment of the set of decisions to a feasible subset A can be 

written in the form of the decision rule: 

if xi G Ai and . . . and Xm C Am then d G ^{A) , (3) 

The subset A is the domain of the rule, while ^{A) is the set of decisions that 
the rule assigns to the objects from A. When the set Aq is the whole value set 
of the condition attribute A^, then this attribute may be omitted in the rule. 
In practice, the set of decisions ^{A) assigned to the element Z\ of the partition 
usually contains a single decision. We assign several equivalent decisions to the 
same Z\, when we are not able to distinguish them with respect to the given 
criterion. 

By a rough classifier we mean every set of the decision rules (3) with the 
domains that form a partition of the condition attribute space into disjoint 
subsets. Each rough classifier assigns the unique set of decisions to every element 
of the condition attribute space. When the assignments obtained with different 
rough classifiers are identical then these classifiers are equivalent. 

Finally, we describe the convention of the decision-making by using the rule- 
set of the rough classifier. In order to classify an object (xi,X 2 , . . .,^m) with 
an unknown value of the decision attribute it is sufficient to find the domain 
A which contains (xi,X 2 , . . .,^m) £^nd assign the set of decisions ^{A) to the 
object. When ^{A) contains a single decision, then the assignment is unique 
and the corresponding rule is decisive. Otherwise, the rule is not decisive. In 
this latter case any decision from a^(Z\), chosen at random with the probability 
card can be assigned to the object. 

3 Cost criterion 

The procedure of searching for the optimal rough classifier has to be guided 
by the precise criterion which unables a comparison of classifiers. Below, we 
present the criterion based on the estimation of the expected value of the cost 
of decision-making concerning new objects. This expected value is denoted by 
E[cost) in the sequel. This value, for a fixed classifier, can be interpreted as the 
average cost when we make a great number of decisions about new objects by 
using the classifier. 

Suppose that a rough classifier is given. Domains of the decision rules of this 
classifier determine a partition of the condition attribute space into disjoint feasi- 
ble subsets. A set of decisions ^{A) is associated with each domain A. Let us fix a 
domain A and a group g. We start computing the average cost of decision-making 
with the assumption that an object taken at random from the universe belongs to 
Z\, and comes from the group g. Assume first that a decision d is assigned to every 
object being in A. Then the average cost E{cost of assigning d|X ^ A^G = g) 
can be estimated by the ratio 

sum of costs of assigning d to objects in A that come from g 



number of objects in A that come from g 



( 4 ) 
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By using the well-known property of the conditional expected value we obtain 



E[cost of assigning d|X G Z\) = 



= E{cost of assigning d|X ^ A^G = g)E{G = g\X. ^ A) = 

9 



1 

E{X e A) 



E{cost of assigning d|X ^ A^G — g)P(K £ A^G — g) 

9 



where stands for summing across all the groups. Denoting the sum in the 
last formula by C(Z\ ^ d), we obtain 

E{cost of assigning d|X G Z\) = P(X G A)~^G[A d). 

Now, let us take all the decisions in k[A). According to our convention of 
the decision-making by use of the rough classifier, the conditional average cost 
E[cost\K. G A) is equal to the mean value of the average costs 
E{cost of assigning d|X G A) corresponding to all d from k[A). Thus, 

^ ^ d'en{^) 

Finally, the average cost of decision-making concerning a new object by using 
the rule-set of the rough classifier is 

E{cost) = Y,E{cost\K e A)P{X € Z\) = ^ X C{A^d'), 

A A ^ ^ d'eG^) 

where the symbol stands for summing across all the rule domains of the 
rough classifier. 

While generating a rough classifier, we try to find not only the optimum 
partition of the condition attribute space, but also to assign the optimum set 
of decisions to each partition element. For a fixed partition, if the probabilities 
P(X E A^G = g) were known then this assignment could be done in an optimal 
way. To explain this, we introduce a notion of a set of admissible decisions for 
A. We define it as the set of these decisions that lead to the minimum value 
of the cost G[A d), where d runs over the set of all decisions. Clearly, for a 
particular Z\, the mean cost 



^ ^ d'en{^) 

in formula (5) takes its minimum value, when n[A) is an arbitrary and non- 
empty subset of the set of admissible decisions for A. In order to avoid the 
ambiguity, we take the whole set of admissible decisions as k[A). In this case 
the mean (6) is equal to the minimum value of G[A d), where d runs over 
the set of all decisions. 
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Now, it is clear that the value of the cost criterion can be treated as a function 
a partition of the condition attribute space. The value of (5) can be rewritten in 
the form 

E{cost) = min C[A d) , (7) 

A d 

where the sum is taken across all the partition elements while the minimum is 
taken across all the decisions. The value of the criterion is determined by the 
partition of the condition attribute space, the probabilities P(X G A^G = g)^ 
and the costs E{cost of assigning d|X G A^G = g). The value of the criterion 
can be estimated by using the training data, prior probabilities, and estimators 
(2) and (4). 

4 General idea of the ProbRough algorithm 

The modification of the ProbRough algorithm, presented in this paper, concerns 
mainly the representation of data (inclusion of new cost attributes), and the 
way of estimating values of the criterion that directs the searching process in 
the space of models. The main ideas and structure of the algorithm described in 
(Piasta and Lenarcik; 1996, 1998) are preserved. 

ProbRough consists of two main phases. In the first phase, the condition 
attribute space is partitioned into feasible subsets in the iteration process that 
is guided by the cost criterion (7). Every partition in this phase is determined 
by a division of the value set of a particular condition attribute. During this 
process the value of the cost criterion can either decrease or remain unchanged. 
In the basic version of ProbRough the number of iterations in the first phase of 
the algorithm is given in advance. The number of iterations can be optimized 
by using the fc-fold cross validation method (Piasta and Lenarcik, 1996). The 
resulting classifiers are generated in the second phase of the algorithm by joining 
the elements of partitions obtained in the first phase. Joining the feasible subsets 
is permitted when they have a common admissible decision. The value of the 
cost criterion is preserved in this phase of the algorithm. 

Illustrative example. 

Now, we illustrate the ProbRough algorithm incorporating costs varying from 
object to object on synthetic data concerning a simplified problem from the area 
of evaluating credit applications. Each object is characterized by two condition 
attributes. The first one, duration of credit in months, is treated as a continuous 
one. The second attribute, account^ is a binary one, and takes on the value 
“yes” when the credit applicant has a bank account, or value “no” otherwise. 
We also consider two cost attributes associated with two decisions: granting a 
credit [decisionl) ^ or refusal of an application [decision2). Table 1 includes the 
hypothetical data. We assume that credits were granted to all clients. Thus, it 
is possible to estimate the costs associated with the decisionl (see. Table 1). 
Negative costs mean profits for the bank. We also assume that the refusal of a 
credit application does not bring any cost for the bank (attribute decision2 in 
Table 1). 
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Table 1. The Illustrative data set. 



object 


duration account 


decisionl 


decisiou2 


1 


10 


YES 


750 


0 


2 


12 


NO 


-250 


0 


3 


16 


NO 


-200 


0 


4 


18 


NO 


250 


0 


5 


24 


YES 


-300 


0 


6 


24 


YES 


2250 


0 


7 


36 


NO 


500 


0 


8 


42 


NO 


-350 


0 


9 


48 


YES 


2000 


0 


10 


48 


NO 


-200 


0 



We use the data from Table 1 as the learning set to explain the process of 
rough classifier generation with the given in advance number of iterations equal 
to 3. We assume prior probabilities equal to the frequencies of groups in the data. 
In such a case the value of the cost criterion (7) for a partition of the condition 
attribute space has a simple interpretation. For a fixed element of the partition 
and a given decision, we obtain the costs related with that decision as the sum of 
individual costs across all the objects belonging to the partition element. Next, 
we repeat this step with all the decisions and choose the minimum value of the 
cost. The above procedure is performed with every partition element. Finally, 
we compute the value of the cost criterion as the ratio of the sum of minimum 
values of the costs across all the partitions, to the number of all learning objects. 

We start the first phase of the algorithm with a trivial partition of the con- 
dition attribute space. The only element of this partition consists of the whole 
space. In order to find the criterion value for this partition, we obtain the sums of 
costs related to making the decisionl and the decision2 as equal to 4450 and 0, 
respectively. Hence, the cost value, as min(4450, 0)/10, is 0. In the first iteration 
of the algorithm, we choose the partition of the condition attribute space that 
yields the minimum value of the criterion. This is a partition induced by the 
segmentation {NO) — {YES) of the value set of the condition attribute account. 
The cost value is equal to —25. In the next iteration, we choose a segmentation, 
which combined with the previous one, yields the minimum value of the cost. 
This segmentation is determined by the intermediate value 39 of the condition 
attribute duration. The resulting partition of the condition attribute space con- 
sists of four elements, and the criterion value corresponding to this partition is 
equal to —55. In the third iteration we choose, in the same way, the segmenta- 
tion determined by the intermediate value 17 of duration. The outcome partition 
consists of six elements and the corresponding cost —100. In the second phase 
of the algorithm we join three elements of the outcome partition with a com- 
mon admissible decision decision2 into one feasible subset. Finally, we obtain 
a unique rough classifier illustrated in Figure 1. The induced classifier can be 
presented as the set of four decision rules: 
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Fig. 1. Partition of the condition attribute space induced by the unique rough 
classifier induced from the Illustrative data set. 



1. if 

2. if 17 < duration < 39 and 

3. if duration > 39 and 

4. if duration < 17 and 



account = Y ES then 
account = NO then 
account = NO then 
account — NO then 



decision2^ 

decision2^ 

decisionl^ 

decisionl. 



5 Behavior of the algorithm on real-life data 

In this section we illustrate the application of our algorithm on the German credit 
data set that was provided to the repository of machine learning databases at 
the University of California by H. Hofmann from the University of Hamburg. 
Credit applicants (objects) are described by the condition attributes of mixed 
types: from unordered qualitative ones, e.g., marital status, job, reason for loan 
request, to continuous quantitative ones, e.g., age, credit amount, duration of 
current account. Each object belongs to one of the two groups: good or bad credit 
applicants. We consider two decisions: granting a credit [decisionl) and refusal 
of a credit application ( decision2 ) . 

The learning set consists of 1000 objects. We assumed the priors: P{G — 
good) = 0.7 and P[G = bad) = 0.3, that reflect the structure of the learning set. 
Since the representation of data in the German credit set does not fully fit the 
representation used in our approach, values of the cost attributes were based on 
the amount of credit attribute. We assumed that while granting a credit, the 
bank profit is 5% of the credit value in the case of a good client, and the bank 
loss is 25% of the credit value in the case of a bad client. 

Using ProbRough with the number of iterations equal to 3 we have obtained 
a number of classifiers corresponding to the same value of the cost criterion 
— 102, 7. One of these classifiers is of the form: 
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1. if al G {1, 2, 3} then decision2 

2. if al = 4 and a5 > 3865.5 and al G {1, 2, 3} then decision2 

3. if al = 4 and a7 G {4, 5} then decisionl 

4. if al = 4 and a5 < 3865.5 and a7 G {1, 2, 3} then decisionl 

By comparison, using ProbRough in the version presented in (Piasta and 
Lenarcik, 1996), with the same number of iterations, we have obtained higher 
value of the cost criterion, equal to —65.0. 

6 Concluding remarks 

The main contribution of our paper lies in modification of the ProbRough al- 
gorithm that enables us to incorporate individual misclassification costs varying 
from object to object. The modification concerns mainly the inclusion of new 
cost attributes to the representation of data, and the way of estimating the cost 
criterion that directs the search process in the space of models. The prior proba- 
bilities related to the groups of objects can also be incorporated. The algorithm 
makes room for non-determinism in assignment of decisions to partition elements 
of the underlying attribute space, which is important when several decisions with 
very close values of the cost compete for assignment. 
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Abstract. This paper describes two soft techniques, GDT-NN and GDT- 
RS, for mining if-then rules in databases with uncertainty and incom- 
pleteness. The techniques are based on a Generalization Distribution 
Table (GDT), in which the probabilistic relationships between concepts 
and instances over discrete domains are represented. The GDT provides a 
probabilistic basis for evaluating the strength of a rule. We describe that a 
GDT can be represented by connectionist networks (GDT-NN for short), 
and if-then rules can be discovered by learning on the GDT-NN. Further- 
more, we combine the GDT with the rough set methodology (GDT-RS 
for short). Thus, we can first find the rules with larger strengths from 
possible rules, and then find minimal relative reducts from the set of rules 
with larger strengths. The strength of a rule represents the uncertainty 
of the rule, which is influenced by both unseen instances and noises. We 
compare GDT-NN with GDT-RS, and describe GDT-RS is a better way 
than GDT-NN for large, complex databases. 



1 Introduction 

Over the last two decades, several inductive methods for learning if-then rules 
and concepts from instances have been proposed. Based on the viewpoint of the 
style of information processing, the inductive methods can be divided into two 
styles: top-down and bottom-up. Usually, the methods belonging to top-down 
style such as ID3 [6] can learn rules very fast, but it is difficult to handle data 
change, to use background knowledge in the learning process, and to perform 
in a parallel-distributed cooperative mode. On the other hand, the methods be- 
longing to bottom-up style such as version-space [1] and back-propagation [2] are 
incremental ones, in which learning a concept is possible not only when instances 
are input simultaneously but also when they are given one by one. Although the 
methods belonging to bottom-up style have no the problems that the methods 
belonging to top-down style have, some issues on real-world applications such as 

— How can rules be learned in the environment with noise and incompleteness? 

— How can unseen instances be predicted, and how can the uncertainty of a 
rule including the prediction be represented explicitly? 

— How can biases be selected and altered dynamically for constraint and search 
control? 

— How can the use of background knowledge be selected according to whether 
background knowledge exists or not? 



L. Polkowski and A. Skowron (Eds.): RSCTC’98, LNAI 1424, pp. 231—238, 1998. 
(c) Springer- Verlag Berlin Heidelberg 1998 
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are still the ones to which no satisfactory solution has been found. 

In this paper, we describe two soft techniques: GDT-NN and GDT-RS, for 
mining if-then rules in databases with uncertainty and incompleteness. The 
techniques are based on a Generalization Distribution Table (GDT), in which 
the probabilistic relationships between concepts and instances over discrete do- 
mains are represented. The GDT provides a probabilistic basis for evaluating 
the strength of a rule. We describe that a GDT can be represented by con- 
nectionist networks (GDT-NN for short), and if-then rules can be discovered by 
learning on the GDT-NN. Furthermore, we combine the GDT with the rough set 
methodology (GDT-RS for short). Thus, we can first find the rules with larger 
strengths from possible rules, and then find minimal relative reducts from the set 
of rules with larger strengths. The strength of a rule represents the uncertainty 
of the rule, which is influenced by both unseen instances and noises. We compare 
GDT-NN with GDT-RS, and describe GDT-RS is a better way than GDT-NN 
for large, complex databases. 



2 GDT-NN 



The central idea of our methodology is to use Generalization Distribution Table 
( GDT)^ as a hypothesis search space for generalization, in which the probabilistic 
relationships between concepts and instances over discrete domains are repre- 
sented [7,8]. We define that a GDT consists of three components: The possible 
instances^ which are denoted in columns in a GDT, are all possible combinations 
of attribute values in a database; The possible generalizations for instances, which 
are denoted in rows in a GDT, are all possible generalization for all possible in- 
stances; The probabilistic relationships between the possible instances and the 
possible cases of generalization, which are denoted in the elements Gij in a GDT, 
are the probabilistic distribution for describing the strength of the relationship 
between every possible instance and every possible generalization. The default 
prior probability distribution is equiprobable, that is. 



p{pr\PG,) 



, Npg, ^ 

, 0 otherwise 



( 1 ) 



where PIj is the jth possible instance, PGi is the Rh possible generalization, 
and NpGi is the number of the possible instances satisfying the Rh possible 
generalization, that is, 

m 

^PGi=W_nj, ( 2 ) 

j 

where j = 1, . . . , m, and j ^ the attribute that is contained by the Rh possible 
generalization (i.e., j just contains the attributes expressed by the wild card as 
shown in Table 1). 
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Furthermore, background knowledge can be used as a bias to constrain the 
possible instances and the prior probabilistic distributions. For example, if we 
use a background knowledge, 

‘^when the air temperature is very high, it is not possible there exists 
some frost at ground leveF , 

then we do not consider the possible instances that are contradictory with this 
background knowledge in all possible combination of different attribute values 
in a earthquake database for creating a GDT. Thus, we can get the more refined 
rules by using background knowledge. 



Table 1. Generalizations Distribution Table for a sample database 



1 1 aObOcO I aObOcl | aOblcO | aOblcTJ 
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albO=}= 














albl=}= 
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Table 2. A sample database 



No 


a 


b 


c 


d 


ul 


aO 


bO 


cl 


y 


u2 


aO 


bl 


cl 


y 


u3 


aO 


bO 


cl 


y 


u4 


al 


bl 


cO 


n 


u5 


aO 


bO 


cl 


n 


u6 


aO 


b2 


cl 


n 


u7 


al 


bl 


cl 


y 



In our approach, the basic process of hypothesis generation is that of gener- 
alizing the instances observed in a database by searching and revising the GDT. 
Table 1 shows a GDT that is generated by using three attributes, a, b, c, in a 
sample database shown in Table 2. In Table 1, is a wild card that means the 
attribute can be any value, and the elements that are not displayed are also all 



zero. 
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A GDT can be represented by connectionist networks for rule discovery in an 
evolutionary, parallel-distributed cooperative mode (GDT-NN for short) [8]. The 
connectionist networks consist of three layers: the input unit layer, the hidden 
unit layer, and the output unit layer. 

A unit that receives instances from a database is called an input unit. A unit 
that receives a result of learning in a hidden unit, which is used as one of the 
rule candidates discovered, is called an output unit. A unit that is neither input 
nor output unit is called a hidden unit. Let the hidden unit layer be further 
divided into stimulus units and association units. The stimulus units are used to 
represent the possible instances like the columns in a GDT, and the association 
units are used to represent the possible generalizations for instances like the 
rows in a GDT. Furthermore, there is a link between the stimulus units and 
an association unit if the association unit represents a possible generalization 
for some possible instances in the stimulus units. Moreover, the probabilistic 
relationships between the possible instances and the possible generalizations are 
denoted in the weights of the links, and the initial weights are equiprobable like 
the Gij of an initial GDT if we do not use any prior background knowledge for 
creating the initial weights. Furthermore, there are two kinds of links: excitatory 
link and inhibitory link that can be changed dynamically. 

We have developed an algorithm (called GDT-NN) for learning if-then rules 
based on the connectionist representation [8]. One good feature of the GDT-NN 
is that every instance in a database is only searched once, and if the data in a 
database are changed (added, deleted, or updated), then we only need to modify 
the connectionist networks and the discovered rules related to the changed data, 
but the database is not searched again. Here we would like to stress that the 
connectionist networks do not need to be explicitly created in advance. They can 
be embodied in the learning algorithm, and we only need to record the weights 
and the units stimulated by instances in a database. However, the recording 
number still is quite large when processing large, complex databases. We need 
to find much better way to solve the problem. 



3 GDT-RS 



In order to solve the problem stated above, we combine the GDT with the 
rough sets methodology (GDT-RS for short). Using the rough set theory as a 
methodology of rule discovery is effective in practice [5,4]. The discovery process 
based on the rough set methodology is that of knowledge reduction in such a way 
that the decision specified could be made by using minimal set of conditions. 
The process of knowledge reduction is similar to the process of generalization in 
a hypothesis search space. By combining the GDT with rough sets, we can first 
find the rules with larger strengths from possible rules, and then find minimal 
relative reducts from the set of rules with larger strengths. Thus, a minimal 
set of rules with larger strengths can be acquired from databases with noisy, 
incomplete data by using GDT-RS. 
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3.1 Rule Representation and Condition/Decision Attributes 

Let T = {U^A^C^ D) be a decision table, U a universe of discourse, A a family 
of equivalence relations over and C^D C A two subsets of attributes that 
are called condition and decision attributes, respectively. The learned rules are 
typically expressed in 

X with 5. {X eC,Y e D) 

That is, “a rule X ^Y has a strength 5 in a given decision table T”. Where X 
denotes the conjunction of the conditions that a concept must satisfy, Y denotes 
a concept that the rule describes, and 5 is a “measure of strength” of which the 
rule holds. 



3.2 Rule Strength 

We define the strength 5 of a rule X ^ y in a given decision table T as follows: 

S{X ^Y) = s{X) X (1 - r(X ^ Y)). (3) 

From Eq. (3) we can see that the strength 5 of a rule is affected by the following 
two factors: 

1. The strength of the generalization X (i.e., the condition of the rule), s. It is 
given by Eq. (4). 

s{PG,) = ^p{PI,\PG,) = (4) 

j " 

where Nins-rel,i is the number of the observed instances satisfying the ith 
generalization. The initial value of s{PGi) is 0. The value will be dynamically 
updated according to giving an input one by one. If all of the instances 
satisfying ith generalization appear, the strength will be the maximal value, 
1. The larger the value of s(PG^), the stronger the ith generalization. 

2. The rate of noises, r. It shows the quality of classification, that is, how many 
instances as the conditions that a rule must satisfy can be classified into 
some class. 



r(X ^ Y) 



A^ins—rel{^) A^ins — class 

Afins—rel (^) 



(5) 



where Nins-rel{^) is the number of the observed instances satisfying the 
generalization X, Nins-classi^i Y) is the number of the instances belonging 
to the class Y within the instances satisfying the generalization X. 



Erom the GDT, we can see that a generalization is 100% true if and only 
if all of instances belonging to this generalization appear. Let us again use the 
example shown in Table 2. Considering the generalization {ao^i}, if instances 
both {ao^ico} and {ao^ici} appear, the strength s{{aobi}) is 1; if only one 
of {ao^ico} and {ao^ici} appears, the strength s{{aobi}) is 0.5, as shown in 
Eigure 1. We can see that both {aobi} and {bici} are generalizations for the 
instance {ao^ici}. But the strengths of them are s{{aobi}) = 0.5 and s{{bici}) = 
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aobiCo 



aobiCi 





Fig. 1. Probability of a generalization rule 



1, respectively. No matter what value of noise rate r may be, the strength of the 
rule {bici} ^ y is greater than the strength of the rule {aobi} y. 

If a generalization contains the instances belonging to different decision clas- 
ses, the rule acquired from the generalization is noisy. Furthermore, when the 
value of noise is over the threshold, the rule is contradictory. As the example in 
Table 2, the generalization {aib\} is such a contradictory generalization. Since 
the strength s{{aibi}) = 1 and the noise rates for decision classes y and n are 
5(ai A ^ y) = S{ai A bi ^ n) = 0.5. Furthermore, a user can specify an 
allowed noise rate as the threshold value. Thus, the rules with the larger rates 
than the threshold value will be deleted. 



3.3 Simplifying a Decision Table by Using the GDT 

By using the GDT, it is obvious that one instance can be expressed by sev- 
eral possible generalizations, and several instances can be also expressed by one 
possible generalization. Simplifying a decision table is to find such a set of gen- 
eralizations, which cover all of the instances in a decision table and the number 
of generalizations is minimal. 

The method of computing the reducts of condition attributes in our approach, 
in principle, is equivalent to the discernibility matrix method [3,5], but we do 
not remove dispensable attributes. This is because 

— The greater the number of dispensable attributes, the more difficult it is to 
acquire the best solution; 

— Some values of a dispensable attribute may be indispensable for some values 
of a decision attribute. 

Figure 2.(1) gives the relationship among generalizations. We can see that 
every generalization in upper levels contains all generalizations related to it in 
lower levels. That is, {ao} D {<^oCi} D {ao^iCi}. In other words, {ao} 

can be specialized into {aobi} and {aoci} only. In contrast, {aobi} and {aoci} 
can be generalized into {ao}. If the rule {ao} ^ ^ is true, the rules {ao^i} ^ y 
and {aoCi} ^ y are also true. 

It is clear that if a generalization for some instances is contradictory, the 
related generalizations in upper levels than this generalization are also contra- 
dictory. As shown in Figure 2.(2), {aoCi} is a contradictory generalization for 
the instance {ao^iCi}, so that the generalizations {ao} and {ci} are also contra- 
dictory. Hence, for the instance {ao^iCi}, the generalizations {aoCi}, {5i}, {ao}, 




Soft Techniques to Data Mining 237 





(1) The relationship among generalizations (2) The generalizations of {aobici} 

I I the generalization with instances in the same class 



^ the generalization with instances in different classes 

Fig. 2. The relationships among generalizations 



and {ci} are contradictory. Thus, only the generalizations {ao^i} and {bici} can 
be used. 

This result is the same as the one of the discernibility matrix method when 
no noise exists in the database [3,5]. Let G_ be contradictory generalizations, Gp 
be all possible consistent generalizations obtained from a discernibility matrix. 
Clearly, Gp = G_. That is, 

Gp = {bi} n ({no} U {ci}) = {biao} U {biCi} 



G- = {aoci} U {bi} = {aoci} fl {bi} = {bi} f1 ({no} U {ci}) 

= {biao} U {biCi}. 

For the database with noises, the generalization that contains instances with 
different classes should be checked. If a generalization contains more instances 
belonging to a class than those belonging to other classes, and the noise rate 
is smaller than a threshold value, the generalization is regarded as a consistent 
generalization of that class. Otherwise, the generalization is contradictory. Fur- 
thermore, if two generalizations in the same level have different strengths, the 
one with larger strength will be selected first. 

3.4 Rule Selection 

There are several possible ways for rule selection. For example, 

— Select the rules that contain as many instances as possible; 

— Select the rules in the levels as high as possible according to the first type 
of biases stated above; 

— Select the rules with larger strengths. 

Here we would like to describe a method of rule selection for our purpose as 
follows: 

— Since our purpose is to simplify the decision table, the rules that contain less 
instances will be deleted if a rule that contains more instances exists. 
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— Since we prefer simpler results of generalization (i.e., more general rules), we 
first consider the rules corresponding to an upper level of generalization. 

— The rules with larger strengths are first selected as the real rules. 

4 Conclusions 

In this paper, we presented two soft techniques, GDT-NN and GDT-RS, for 
mining if-then rules in databases with uncertainty and incompleteness. We de- 
scribed basic concepts and principles of our methodology. Some of databases 
such as postoperative patient, earthquack, weather, mushroom, and cancer have 
been tested or are being tested for our approaches. Although both GDT-NN 
and GDT-RS are very soft techniques for data mining, GDT-RS is a better one 
than GDT-NN for large, complex databases. By using the GDT-RS, we can first 
find the rules with larger strengths from possible rules, and then find minimal 
relative reducts from the set of rules with larger strengths. Thus, a minimal 
set of rules with larger strengths can be acquired from databases with noisy, 
incomplete data. 
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[ Abstract.] Institutional databases can be instrumental in un- 
derstanding a business process, but additional data may broaden 
the empirical perspective on the investigated process. We present 
a few data mining principles by which a business process can 
be analyzed and the results represented. Sequential and par- 
allel process decomposition can apply in a data driven way, 
guided by a combination of automated discovery and human 
judgment. Repeatedly, human operators formulate open ques- 
tions, use queries to prepare the data, issue quests to invoke 
automated search, and interpret the discovered knowledge. As 
an example we use mining for knowledge about student enroll- 
ment, which is an essential part of the university educational 
process. The target of discovery has been the understanding of 
the university enrollment. Many discoveries have been made. 
The particularly surprising findings have been presented to the 
university administrators and affected the institutional policies. 



1 Business process analysis 

Many databases have been developed to store detailed information about business 
processes. By design, the data capture the key information about events that add up to 
the entire process. For instance, university databases keep track of student enrollment, 
grades, financial aid and other key information recorded each semester or each year. 

KDD can facilitate process understanding. In addition to the known elements 
of the process, which have been used in database design, further knowledge can be 
discovered by empirical analysis. We can use a discovery mechanism to mine a database 
in search of knowledge useful in postulating a particularly justified hidden structure 
within the process. For instance, it may turn out that different groups of students 
finish their degrees in different proportion or take drastically different numbers of 
credit hours. 

In this paper we focus on knowledge derived from data, in distinction to expert 
knowledge. We present a discovery process that results in knowledge which aids the 
business process understanding. The discovery process is driven by two factors. The first 
is the basic structure of the business process and the way it is represented by database 
schemas and attributes. It is used to plan data preparation and search problems. The 
second factor is the data and empirical knowledge they can provide. The knowledge 
discovered from data can lead to further data preparation and search for knowledge. 

Quests generate knowledge while queries prepare data. Automated search for 
knowledge can use discovery systems such as EXPLORA Kldsgen, 1992; KDW: Piatetsky— 
Shapiro and Matheus, 1991; 49er: Zytkow & Zembowicz, 1993; KDD-R: Ziarko & Shan, 
1994; Rosetta: Ohrn, Komorowski, Skowron, Synak, 1998. A discovery system requires 
a well defined search problem, which we call a quest. It also requires data, which are 
defined by a query. Queries can be supported by a DBMS, but when data come from 
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several databases, data miners must use their own application programs. An extended 
KDD process can be described by a sequence of quests and queries. 

Database design knowledge includes temporal relations between attributes. 

When data describe a business process, the temporal order of events captured by 
different attributes is clear most of the time. Since an effect cannot precede the cause, 
questions about possible causal relations are constrained within the temporal relation 
between attributes. Much of the sophisticated search for causes, performed by systems 
such as TETRAD (Spirtes, Glymour & Scheines, 1993) is not needed. A typical question 
about causes that still can be asked is “which among the temporarily prior attributes 
influences a given set of target attributes?” 

Vital characteristics of a process include throughput, output and duration. 

We do not need to argue that the output and/or throughput are the most important 
effects of each business process. But it is also important to know how long is the process 
active. This is needed to develop process effectiveness metrics and also to decide for 
how long should the data be kept in active records. The quest for knowledge about 
process output, throughput and duration will guide our data mining effort. In practical 
applications, a business process consists of many elementary processes added together. 
For instance, the university “production of credit hours” is the sum of the enrollment 
histories of individual students. For a given cohort of students, the total throughput 
can be described by the histogram of the attribute “credit hours” . 

Process can be split into sequential and parallel components. Each of the 
parallel subprocesses uses a part of the input and contributes a part of the output or 
throughput. The inputs and outputs of parallel subprocesses add up to the input and 
output of the entire process. 

Process P can be decomposed sequentially into subprocess Pi followed by P 2 , when 
the output of process P\ is the input to process P 2 . 

Subprocesses can be further decomposed in sequence or in parallel. We can also 
seek explanation of the input to the business process P by processes that are prior to 
P and supply parts of P’s input. The data come from various sources, external to the 
business process database. 

Parallel decomposition can be guided by regularities between input and out- 
put/throughput attributes. Let C be an attribute which describes the throughput 
of process P. Let Vc be the range of values of C. The histogram of C is the mapping 
h : Vc — ^ A", where for each c 6 Vc, h(c) = n is the number of occurrences of c in the 
data. We can use the histogram of C to measure the efficiency of the process P. 

Consider a regularity in the form of a contingency table for an attribute A, that 
describes the input of P, and the throughput C . For each a E Va and each c E Vc 
, P(a, c) is the probability of the value combination (a, c) derived from data. Using 
the distribution of inputs given by the histogram h(a),a E Va we can compute the 
throughput histogram, by converting probabilities p(a, c) into p{c\a) and using: 

He) = X)p(c| a),VceVc. (1) 

aEA 

When the histograms of A and C and the contingency table have been derived from 
the same data, the equation (1) is an identity. But the probability distributions p{c\a) 
for each a and equation (1) can provide valuable predictions for a new similar process, 
when only the histogram of the input A is known. 

When probability profiles (vectors) p(x|a),a G Va, differ for different values of A, 
it makes sense to think about process P as a combination of parallel processes Pa for 
each value a eVa- Parallel decomposition is particularly useful when: 

1. differences between probability profiles are big; 

2. attribute A can be controlled by the business process manager; 

3. the histograms of A undergo big variations for inputs made at different time; 

4. we expect that process P goes on differently when Pa’s differ. 
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Sequential analysis is useful when some attributes describe stages of a pro- 
cess. Often, in addition to the input and output attributes, other attributes describe 
intermediate results. Suppose that an attribute B allows us decompose process P se- 
quentially into Pi and P 2 , and can be used as a measure of effectiveness of Pi. This 
is particularly useful, when the subprocess Pi applies only to some inputs while some 
other inputs can be used as a control group, so that the effectiveness of Pi can be 
compared with the control group. We will discuss remedial education and financial aid 
and their effectiveness as examples. 

Sequential analysis is also important when it leads to knowledge useful in predicting 
process input. For instance, the number of new students depends on the number of high 
school graduates, on the cost of study per credit hour, and so on. Regularities derived 
from past data lead to predictions about student enrollment in the future. 

Since the attributes that can provide predictions of new input to the business 
process are typically not available in the business process database (B), other databases 
must be searched for relevant information. A relevant database D must include at least 
one attribute Ai that provides information temporarily prior to the input attribute A 2 
in B and at least one attribute J that can be used to join B and D. Further, the 
join’ed data table B+D must yield a regularity between Ai and A 2 . For instance, a 
regularity between high school graduation and freshmen enrollment has been detected 
from tables (1), (3), and (5), listed below. 



2 A walk-through example: university enrollment 

Understanding the factors in enrollment decline and increase is critical for universities, 
as often the resources available to the university depend on the number of credit hours 
the students enroll. Many specific steps to increase enrollment may not be productive 
because enrollment is a complex phenomenon, especially in metropolitan institutions 
where the student population is diverse in age, ethnic origin and socio-economic status. 

Student databases kept at every university can be instrumental in understand- 
ing the enrollment. We have applied the process analysis methodology to a university 
database exploration and step after step expanded our understanding of the enrollment 
process. The initial discovery goals have been simple but their subsequent refinement 
led to sophisticated knowledge that surprised us and influenced university administra- 
tors. Within the limits of this paper we only describe a few steps and a few results. 
Our previous research on enrollment has been reported by Sanjeev & Zytkow (1996). 

Our data came from several sources. Consider a student database that consists 
of the following files (tables) 1-4 and an additional database (5): 

(1) Grade Tape for each academic term (Mainframe, Sequential File) 

(2) Student History File (Mainframe, VSAM file) 

(3) Student Transcript file (Mainframe, VSAM file) 

(4) Student Financial Assistance (Mainframe, IMS/DLl database) 

(5) High School Graduates; Kansas State Board of Education 

We used temporal precedence to group the attributes into three categories. 
Category 1 describes students prior to their university enrollment. It includes demo- 
graphics: age at first term, ethnicity, sex, and so forth, as well as high school informa- 
tion: the graduation year, high school name, high school grade point average (hsgpa), 
rank in the graduating class, the results on standardized tests (COMPACT) and so on. 
All these attributes come from the tables (1) and (5). 

The attributes in Category 2 describe events in the course of study: hours of 
remedial education in the first term (remhr), performance in basic skills classes dur- 
ing the first term, cumulative grade point average (cumgpa), number of academic 
terms skipped, maximum number of academic terms skipped in a row, number of 
times changed major, number of times placed on probation, and academic dismissal. 
All these attributes come from the tables (2), (3), and (4). 

Category 3 includes the goal attributes. They capture the global characteristics 
of a business process: the output, throughput and duration. In our example we use 
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academic degrees received (DEGREE) as a direct measure of desired output. Bachelor 
degrees are awarded after completing approximately 120 credit hours. But all credit 
hours taken by a student contribute to the total credit hours, which determines the 
university’s budget. Thus the total number of credit hours taken (CURRHRS) measures 
the process throughput. Process duration can be measured by the number of academic 
terms enrolled (nterm) by a student. All these attributes come from table (3). 

Query- 1 prepared the initial data table. In our walk-through example we use 
attributes in all three categories. We analyze a homogeneous yet large group of students, 
containing first-time, full-time freshmen with no previous college experience, from the 
Fall 1986. The choice of the year provides sufficient time for the students to receive a 
bachelor degree by the time we conducted our study, even after a number of stop-outs. 

Query-1 prepares data in a number of steps: (a) Select from the table (1) for the Fall 
1986 all freshmen (class=l) without previous college experience Prev=0 and sex=l 
or sex=2 for new males and/or new females; (b) join the result with the transcript file 
(3) by SSN; (c) create the attributes that total the credit hours, the remedial classes, 
the number of semesters enrolled, average the grades, etc. (d) project the attributes in 
Categories 1, 2, and 3, including all the attributes created in step (c). 

We used the 49er KDD system to search for knowledge. 49er (Zytkow & Zem- 
bowicz, 1993) discovers knowledge in the form of statements “Pattern P holds for data 
in range R”. A range of data is the whole dataset or a data subset distinguished by 
conditions imposed on one or more attributes. Examples of patterns include contin- 
gency tables, equations, and logical equivalence. Contingency tables are very useful as 
a general tool for expressing statistical knowledge which cannot be summarized into 
specialized patterns such as equations or logical expressions. Since enrollment data lead 
to fuzzy knowledge, in this paper we will only consider contingency tables, although 
some enrollment knowledge has been approximated by equations. 

49er can be used on any relational table (data matrix). It systematically searches 
patterns for different combinations of attributes and data subsets. Initially, 49er looks 
for contingency tables, but if the data follow a more specific pattern, it can turn on a 
specialized discovery mechanism, such as a search in the space of equations. 

If the statistical test of significance exceeds the acceptance thresholds, a hypothesis 
is qualified as a regularity. The significance indicates sufficient evidence. It is measured 
by the (low) probability Q that a given sample is a statistical fluctuation of random 
distribution. While in typical “manual” applications of statistics researchers accept 
regularities with Q < 0.05, 49er typically uses much lower thresholds, on the order of 
Q < 10“^, because in a single run it can examine many thousands of hypotheses, so 
many random patterns look significant at the level 0.05. 

49er’s principal, if crude, measurement of predictive strength of contingency tables 
is based on Cramer’s V coefficient 

V = ^/x^/(Nmin(Mro^. ~ l,Mcot ~ 1)), 

for a given Mrow x Mcoi contingency table, and a given number N of records. Both Q 
and V are derived from the statistics which measures the distance between tables 
of actual and expected counts. We have used V to detect tables which can be used for 
parallel process decomposition. To the same end we also use correspondence analysis 
to capture large differences between different probability profiles, but the details go 
beyond the scope of this paper. 

Quest- 1, the initial discovery tasks, has been a broad search request. We re- 
quested all regularities between attributes in Category 1 and 2 as independent variables 
against attributes in Category 3. 

In response to quest-1, the 49er’s discovery process resulted in many regu- 
larities. In this paper, we focus on a selected few. Let us mention a few other examples 
(Sanjeev & Zytkow, 1996): big differences exist in college persistence among races and 
among students of different age; students who changed their majors several times re- 
ceived degrees at the highest percentage. 
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Academic results in high school turned out to be the best predictor of persistence 
and superior performance in college. Similar conclusions have been reached by Druzdzel 
and Glymour (1994) through application of TETRAD (Spirtes, Glymour & Scheines, 
1993). They used summary data for many universities, in which every university has 
been represented by one record of many attributes that represent various totals and 
averages. Since we considered records for individual students we have been able to 
derive further interesting conclusions. 

Among the measures of high school performance and academic ability, our results 
indicate that composite AGT score is a better predictor than either high school grade 
point average (hsgpa) or the ranking in the graduating class. This can be seen by 
comparing Tables 1-a and b. Table 1-b shows a regularity which is slightly stronger 
(V:0.20 vs 0.19). 

Analogous patterns of approximately the same strength and significance relate 
GOMPAGT and HSGPA with all three goal variables. The corresponding tables cannot 
be reproduced due to the space limit. 

Parallel decomposition of the process by the AGT scores has been very useful, and 
led to further findings, when different subprocesses have been analyzed in detail. We 
will see that in the case of remedial instruction. Since the values of AGT and HSGPA are 
closely related, it does not make sense to create separate parallel processes for HSGPA. 

A sequential analysis problem: does financial aid help retention? Financial 
aid attributes belong to Gategory 2, the events that occur during the study process. 
Aid is available in the form of grants, loans and scholarships. The task of Query- 2 has 
been to utilize the financial aid data. By joining the source table (4) with the table 
obtained as a result of query- 1 we augmented that table with 64 attributes: eight types 
of financial aid awarded to students in each of the 8 fiscal years 1987-94. 

Quest-2 confronted the new aid attributes with with our goal attributes 
(Category 3). The results were surprising. No evidence has been found that financial 
aid causes students to enroll in more terms, take more credit hours and receive degrees. 
For instance, the patterns for financial aid received in the first fiscal year indicated a 
random relation with very high probability Q = 0.88 (for terms enrolled), Q = 0.24 
(for credit hours taken) and Q = 0.36 (for degrees received). None would pass even the 
least demanding threshold of significance. 

These negative results stimulated us to use query- 3 and query-4 to select the sub- 
groups of students at two extremes of the spectrum: those needing remedial instruction 
and those who had received high school grade ’A’/’B'. In each of two subgroups we tried 
quest- 3 and quest- 4 analogous to quest- 2. 
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Table 2. Actual Tables for DEGREE vs REM HR (a) all students (b) remedial 
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In the quest-3, for students needing remedial instruction, we sought a possible 
impact on the Category 3 variables of financial aid received in the first fiscal year, when 
the remedial instruction has been provided. The results were negative: the patterns 
among the amount of financial aids received and the goal variables had the following 
high probabilities of randomness: Q = 0.11 (for terms enrolled), Q = 0.22 (for credit 
hours taken) and Q = 0.86 (for degrees received). In response to quest-4, in the group 
of students receiving high school grade ’A’/’B', the corresponding probabilities were 
Q = 0.99 (for terms enrolled), Q = 0.99 (for credit hours taken) and Q = 0.94 (for 
degrees received). These findings indicate that financial aid received by students in the 
first year was not helpful in their retention. 

Using query- 5 we created an attribute which provides the total dollar amount of 
aid received in fiscal years 1987 - 1994. Now, with quest- 5 we sought the impact of this 
variable on the goal attributes in Category 3. Finally, a positive influence of financial 
aid has been detected (Sanjeev & Zytkow, 1996), but the results are due to the fact that 
in order to receive financial aid the student must be enrolled. We could not demonstrate 
that financial aid plays the role of seed money by increasing the enrollment in the years 
when it hasn’t been received. 

An example of sequential analysis: remedial instruction. One of the indepen- 
dent variables in Category 2, used in quest-1 has been REMHR (total number of remedial 
hours taken in the first term). An intriguing regularity has been returned by the search 
(Table 2-a): ‘‘Students who took reraedml hours in their first term are less likely to 
receive a degree’^ The percentage of students receiving a degree decreased from 31% 
for REMHR=0 to 13% for REMHR=8. This is a disturbing result, since the purpose of 
remedial classes is to prepare students for the regular classes. 

Query-6: select students needing remedial instruction. After a brief analysis 
we realized that Table 2-a is misleading. Remedial instruction is intended only for 
the academically under-prepared students, while students who do not need remedial 
instruction are not the right control group to be compared with. In order to obtain rel- 
evant data we had to identify students for whom remedial education had been intended 
and analyze the success only for those students. After discussing with several admin- 
istrators, the need for remedial instruction was defined as query-6: select a composite 
AGT score of less than 20 and either having high school GPA of ’GY’D’/’E' or graduating 
in the bottom 30% of the class. Those students for whom the remedial instruction was 
intended but did not take it, played the role of the control group. 

Quest- 6: for data selected by query- 6 search for regularities between REMHR 
and process performance attributes. Use attributes in Category 1 to make 
subsets of data. 49er’s results were again surprising because no evidence has been 
found that remedial instruction helps the academically under-prepared students to 
enroll in more terms, take more credit hours and receive degrees. 

Table 2-b indicates that taking remedial classes does not improve the chances for 
a student to reach a degree. For instance, those students who did not take remedial 
classes, but needed them according to our criteria, received bachelor and associate 
degrees at about the same percentage (10.8% vs 9.9%) when compared to those who 
took from 3 to 8 hours of remedial class. A similar table indicates no relationship 
{Q = 0.98) between hours of remedial classes taken and number of terms enrolled. 
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Finally, let us briefly discuss how external data can be used to predict part of 
the input. Table (5) provides the number of high school graduates by year and county. 
Knowing the fraction of that number who enroll at WSU, we can predict in June, a 
part of university enrollment in August. Query-7 joined tables (1) and (5) by the year, 
but to make a join operation possible it aggregated each table (1) for the corresponding 
years by the county, counting the numbers of students per county. Now, quest- 7 has 
determined the percentage of students who transfer from high school to WSU. That 
number, recently at about 20% can be used to predict new enrollment. 



3 Impact of findings on the business process 

In 1995-97 the results of our enrollment research have been presented to senior uni- 
versity administrators including Vice President of Academic Affairs, Associate Vice 
Presidents, Director of Budget and academic college Deans. Many of them chaired or 
were part of executive purpose committees like Strategic Plan Task Force, Academic 
Affairs Management Group, and University Retention Management Committee. 

Starting in the Spring 1997, for the fourth consecutive enrollment period, WSU’s en- 
rollment has increased. It is the most consistent enrollment increase in the 1990s. While 
business decisions are not always based entirely on empirical evidence, such evidence 
helps to make well-informed decisions. We discuss here some of the strategic decisions, 
and outline how our findings have formed their underlying empirical foundation. 

“WSU will recruit and retain high quality students from a variety of ethnic 
and socioeconomic backgrounds” is the second of the five Goals and Objectives 
stated in the draft Strategic Plan for Wichita State University. This strategic plan, 
outlined in 1997, is currently being presented to the various university constituencies, 
such as the faculty senate, for review and acceptance. 

In 1995 and 1996, our research uncovered that academic results in high school 
are the best predictor of persistence and superior performance in college. Tables 1 
show regularities between composite ACT and average grade in high school (hsgpa) 
as predictors and the college performance attributes: cumulative credit hours taken 
(CURRHRS), total academic terms enrolled (nterm), and degrees received (DEGREE). 

The eight year graduation rate measure has been included as WSU’s perfor- 
mance indicator. A strategic planning process called VISION 2020, initiated by the 
State of Kansas, requires the universities to formulate a set of performance indicators 
and report the results. The first of the core indicators concerns undergraduate student 
retention and graduation rates. The report mandated by the Regents asked for gradu- 
ation rate measures after four, five, and six years. But the students at WSU take often 
longer than six years to graduate: they tend to stop-out for several academic terms 
during their college careers and enroll in less than 15 hrs in one semester. This phe- 
nomenon has been a conclusion from our studies in 1995 and 1996. It can be partially 
observed in Table 1-c. It shows that a significant percentage of students enroll above 
11 terms. In addition many students stop out for few semesters. As a result, among the 
six Regents universities in Kansas, WSU is the only institution in which the graduation 
rate is also measured at the end of the eight year. 

Our negative results increased the awareness of the cost of remedial educa- 
tion. Although the upcoming replacement of open admission by entry requirements is 
pushing aside the question of reforming the remedial education, university administra- 
tors are increasingly questioning the effectiveness of remedial classes. In the Fall 1997 
a cost study has been conducted on remedial education programs. Can those costs he 
justified in the absence of empirically provable success? is currently being discussed. 



4 Process networks vs. Bayesian networks 

While Bayesian networks (Heckerman, 1996; Spirtes Glymour & Scheines, 1993) em- 
phasize the relationships between attributes, the business process networks capture 
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the relations between states of affairs at different time. They resemble the physical ap- 
proach to causality: the state of a system at time T\ along with the domain regularities 
causes the state at time T 2 . This is a more basic understanding of causes. 

In distinction to Bayesian networks that relate entire data through probabilistic 
relations between attributes, a subprocess often involves only a slice of data. For in- 
stance, remedial classes are taken by only some students. One part of a process can 
be decomposed differently than another part and the corresponding subsets of records 
can hold different regularities. For instance, a causal relation between attributes may 
differ or not exist in a subset of data characterized by a subset of attribute values. 

Conclusions. We have introduced a number of knowledge discovery techniques useful 
in analyzing business processes. A process can be divided into parallel components when 
it improves the predictions. Processes can be analyzed sequentially, to find out how the 
preceding process influences the next in sequence. We also demonstrated how queries 
that seek data can be combined with quests that seek knowledge. In a KDD process, 
queries are instrumental to quests. The data that we used as a walk-through example 
have been obtained from a large student database, augmented with statistics kept by 
the State of Kansas. Both have been explored by 49er, leading to many discoveries. In 
this paper we used a few examples of practically important findings that have influenced 
the University policies. 
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Abstract. A genetic routing method for switchbox problem with novel 
coding technique is presented. The principle of proposed method and 
results of computer experiments are described in detail. 



1 Introduction 

Genetic algorithm (GA) is a method to search optimum solution by avoiding 
to fall down into local minima which is based on mechanics of natural selection 
and geneticsfl]. In VLSI physical design, wiring region is composed of channels 
and switchboxes. Ghannels limit their interconnection terminals to one pair of 
parallel sides. Switchboxes are generalizations of channels and allow terminals 
on all four sides of the region[2]. Several channel routers using GA have been 
reported [3] [4], while switchbox router by GA haven’t been proposed. Switchbox 
routing problem is more difficult than channel routing problem. Several attempts 
have been reported on switchbox routing problem. “WEAVER” which is based 
on knowledge[5] and “BEAVER” which is based on computational geometry 
[6] are famous switchbox routers. The quality of solution by them are better 
than or comparable to previously reported solutions. However these systems are 
complicated and required long calculation time. 

In this paper, a genetic routing method for switchbox problem is presented. 
The method is not complicated and effective for practical switchbox problem. 

2 Routing method 

The order of routing is very important in switchbox routing problems. 

Figure 1(a) shows the example that the net a-a' was connected first, while 
figure 1(b) shows the same example that net b-b' was connected first. It is 
impossible to connect the net a-a' in figure 1(b). 

The presented method isn’t influenced by order of routing, because all nets 
are routed in the same time. That is, the shapes of all wires are supposed first, 
then all wires are improved by using genetic algorithm. However, if the coding 
technique and the crossover operation are not appropriate, the result will be the 
same one by random search. 
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b 



a 




b 



(a) a-a' is connected firstCb) b-b' is connected first. 



Fig. 1. Influence of routing order 



2.1 Model 

In this paper, the switchbox model has two layers. The lower layer has only 
X- direct ion wires, while the upper layer has only ^-direction wires. 
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(b) D-type 



Fig. 2. Types of wire shape 



In this paper, two types of wire shape are defined. One of them is “S-type” 
where the source terminal and the destination terminal are same layer. The 
other one is “D-type” where the source terminal and the destination terminal 
are different layer. Figure 2 (a) and (b) show the corresponding examples. The 
S-type wire is defined by four vias^ ^ T4, while the D-type wire is defined by 
three vias P\ ^ P3. 

Figure 3 shows some special cases. These cases can be regarded as cases 
that vias P2 and P3 are overlapped. However, in some cases, some additional 
operations are required. Figure 4 show the example which has wasteful wire. 
The operations P2 Pa and P3 ^ P4 have to be applied in this case. 



1 a 



Via” is a hole to contact wires in different layers. 
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Fig. 3. Examples with overlapped vias 



Fig. 4. Example with 
wasteful wire 



2.2 Coding 

In the coding technique, it is important to generate no lethal gene and to keep 
schemata. As a preparation of coding, the coordinates of vias are summarized in 
figure 5. It is clear that following parameters are required to define wire shape. 

(Case 1) S-type; Fg is on the x-axis. • • • {xa^Va^Vb} 

(Case 2) S-type; Fg is on the ^-axis. • • • Va^xi^} 

(Case 3) D -type; Fg is on the x-axis. • • • {xa^pa} 

(Case 4) D-type; Fg is on the ^-axis. • • • {xa^pa} 

Therefore, the following code can be obtained as an example. 



{Xa^Va^z} ( 1 ) 

where, z is pjj in case 1, x^ in case 2, and ignored in case 3 and 4. However, 
it is more effective to change x-coordinate and ^-coordinate at the same time 
in the search of optimum vias. Therefore, we use grid number instead of x^p 
coordinates. Figure 6 shows grid numbers. In this paper, eq.(2) is used as a code 
for one wire, 

( 2 ) 

where, Ga and Gjj mean grid numbers. In the case 1, only ^-coordinate of grid 
Gij is used, while only x-coordinate is used in case 2. In the case 3 and 4, grid 
Gjj is ignored. Figure 7 shows the format of chromosome for all wires. 

2.3 Crossover 

The uniform crossover technique is used in the presented method. 

2.4 Mutation 

In the mutation operation of the presented method, sometimes the wire shape 
doesn’t change, because the grid Gjj is partially used or not used. 
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Fig. 5. Coordinates of vias in wires 



2.5 Evaluation function 



The evaluation function is defined as follows. 



/ = exp 



A. 



a- 



p 



Vd 

K, 



-7- 



Lt — Ly 

Trvj, 






( 3 ) 



where Ld is the overlapped wire length, Lt is the obtained total wire length, Lm 
is the minimum total wire length^, Vd is the number of overlapped vias. Vs is the 
standard via numbers, aCf3Cj are weight constants. In the presented method. 
Vs is calculated as follows, 

v; = ANs V (4) 



where Ns and Nd are the numbers of S-type and D-type wires, respectively. 



2.6 Generation of initial population 

To improve the convergence of GA, the following initial population are used. 

^ Lm is calculated by the summation of Manhattan lengths between source and desti- 
nation terminals. 
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Fig. 6. Grid numbers 
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Fig. 7. Format of chromosome 



(Case 1) S-type; Pg is on the x axis. - • -Ga = {xs,m),Gb = [xii,m) 

(Case 2) S-type; Pg is on the y axis.- ■ -Ga= {n,yg),Gb = {n,yd) 

(Case 3) D -type; Fq is on the x axis.* • - Ga = (^s, Vd) 

(Case 4) D-type; Fq is on the y axis.* * - Ga = Vs) 



where, rn and n are random numbers (1 < m < Ny^ ^ G n < Nx)- These 
value represent simple wire shapes which are illustrated in figure 3. However, 
50% of initial populations are made from random value to keep large initial 
solution space. 



3 Computer experiments 

In order to examine the proposed method, we made a routing program using 
Visual BASIC and executed computer experiments on a PC with Pentium MPU 
(133 MHz). 

Table 1 summarizes the results of 10 examples with 10 x 10 size and 10 
nets. The routing program executed 10 times for each netlist. In the table 1, Fg 
means the probability for 100 % connection. Gmin £^nd Gmax mean the minimum 
and maximum generation numbers to obtain 100% connection, respectively. The 
symbol means that 100 % connection was failed. 

In these experiments, the total wire length is not considered (7 = 0). The 
parameters of GA are as follows; population number is 100, mutation ratio is 
2 %, elite number is 20 , maximum generation number is 200 , and o; = /? = 10 . 
The execution time for one generation is 50 [msec]. The results show that 100% 
connections are obtained in almost all cases. 
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Table 1. Probabilities and generation numbers for 100% connection 
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Fig. 8. An example of routing (10 x 10 size, 10 nets) 



Figure 8 shows an example of routing result (netlist A^i). 

Figure 9 and figure 10 show the results of example with 20 x 20 size and 
20 nets. In figure 9, the total wire length is not considered (7 = 0), while in 
figure 10, it is considered (7 = 1). The minimum generation numbers for 100% 
connection are 42 and 100, respectively. The calculation time for one generation 
is 130 [msec]. 

Finally, a result for famous Berstein’s switchbox problem[7] is shown in figure 
11 (23 X 15 size, 24 nets). In this case, two nets couldn’t be connected, while 
WEAVER and BEAVER performed 100% connection. The reason must be low 
freedom of wire shapes. That is, via number and wire direction in each layer 
are restricted in the presented method. There are 500 population in this case. 
The improvement of solutions saturated at the 13-th generation. The nets which 
includes more than two terminals were connected by divided into two terminal 
nets. 
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Fig. 9. An example of routing (20 x 20 
size, 20 nets, 7 = 0) 



Fig. 10. An example of routing (20 x 20 
size, 20 nets, 7 = 1 ) 




Fig. 11. An example of Berstein’s switchbox problem 



4 Conclusions 

A novel routing method for switchbox problem using genetic algorithm was pre- 
sented. The computer experiments showed that the proposed method is effective 
for practical routing problems. 

The future works are to improve the connection ability and to reduce the 
calculation time. 
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Abstract. For the calibration of laser induced plasma spectrometers 
robust and efficient local search methods are required. Therefore, several 
local optimizers from nonlinear optimization, random search and evolu- 
tionary computation are compared. It is shown that evolutionary algo- 
rithms are superior with respect to reliability and efficiency. To enhance 
the local search of an evolutionary algorithm a new method of random 
memorizing is introduced. It leads to a substantial gain in efficiency for 
a reliable local search. 



1 Introduction and Motivation 

Laser induced plasma spectrometry is one of the latest developments in measure- 
ment technologies. It is very precise, can be applied to multi-element analysis 
without any preparation, and can be used especially in process measurement 
such as glass and steel processing. 

Since laser spectrometry is very precise the calibration of such devices has 
to be also very precise. This is a difficult task. The calibration accuracy de- 
pends on the time dependent modeling of the atomic light emission of a mirco- 
plasma. The calibration model is usually nonlinear because of different physical 
effects like re-absorption and light scattering. It equally well depends on the 
precise adaptation of the calibration model parameters to the calibration sam- 
ples. This problem is ill-posed. Regularization has to be done due to facts from 
laser physics. Furthermore, from experience, it is ill-conditioned with condition 
indices Xmax / ^min > 10^. It is also high dimensional. For the calibration of 60 
elements 120 to 180 parameters have to be adapted. 

For the calibration we designed a hierarchical adaptation procedure. At 
higher levels the search space will be constrained by evidences from plasma 
physics. Coarse search procedures are applied. At the lowest level a fine tuning 
of the model parameters is done. It is outside the scope of this paper to describe 
the whole procedure. Here we will explain to some extent the lowest level of cali- 
bration parameter fine tuning. For this task the goal was to find an efficient and 
robust local search algorithm which does not use analytic gradient information. 

There was made an extensive exploration of the available algorithms. A com- 
parison is given in section II. 
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For the fine tuning we apply the simplest evolutionary algorithm, the (1 + 1)- 
Evolution Strategy - but with a memory. This memory is accessed at random to 
find search direction based on former good solutions. By a beam search this mem- 
orized directions will be exploited. In section III we present a conceptual algo- 
rithm for the enhancement of evolutionary algorithms with global step size adap- 
tation by random memorizing. For an implementation with an (l + l)-Evolution 
Strategy results for difficult unimodal test functions are presented and compared. 

Finally, in section IV the application to the calibration of laser spectrometers 
is described in more detail. 

2 Searching for a Robust and Efficient Local Optimizer 

The assessment of the problem solving capacity of different local optimizers^ 
was done using the well known Rosenbrock function /r (see next section) for 
n = 2, ..., 100 variables. The initial point was set to X{ = 0, i = 1, ..., n. The num- 
ber of function evaluations was counted until the function value fstop < 10“^ 
was reached or a maximum number of generations had passed. The results are 
summarized in the following table where MEAN is the average number of func- 
tion evaluations to reach fstop ^ 10“^, SD the corresponding standard devia- 
tion, SUCC the number of successful runs related to the total number of runs 
(succ/total), and BF gives the best function value reached for non-converging 
runs. 

A natural first step in looking for good local optimizers is to consult avail- 
able program packages including procedures for mathematical programming or 
nonlinear optimization (e.g. [14]). Among the recommended procedures are the 
simplex method [10] and different variants of second order optimization methods 
based on conjugate directions [12] or quasi Newton procedures [11]. 

The results of the application of these methods to the minimization of func- 
tion fn are given in Table 1. As it can be seen at once non of these methods 
is reliable enough to locate the optimum with the desired accuracy for dimen- 
sions up to 100 variables. Similar observations have been made already for other 
functions in [3]. In a next step a random search technique [13] and the dynamic 
hill-climber [7] were analyzed (viz. Table 1). The effort to reach the optimum 
with the random search technique for a very moderate dimension of n = 10 was 
much higher than with the conjugate direction method. 

And even the dynamic hill-climber, though very reliable for dimensions up 
to n = 30, could not locate the optimum for n = 100. 

Therefore we looked for alternatives which may be offered by evolutionary 
algorithms. 

As has been shown [8] [9] the Breeder Genetic Algorithm with fuzzy gene 
pool recombination, the utilization of covariances and generalized elitist selec- 
tion is a very reliable search method. The same is true for the (1, A)-Evolution 

^ The results of Table 1, except for the method of Hansen et.ah, were kindly provided 
by Dirk Schlierkamp-Voosen. Many helpful discussions with Heinz Miihlenbein and 
Dirk Schlierkamp-Voosen from the GMD are gratefully acknowledged. 
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Table 1. Experimental results for the different analyzed methods 



n 


MEAN 


SD 


SUCC 


BE 


MEAN 


SD 


SUCC 


BE 




Nelder and Mead 


Powell 


2 


1.690E-h02 


O.OOOE-hOO 


0/1 


4.384E-01 


4.350E-h02 


O.OOOE-hOO 


1/1 




3 


2.030E-h02 


O.OOOE-hOO 


0/1 


1.603E-h00 


6.380E-h02 


O.OOOE-hOO 


1/1 




10 


3.382E-h03 


O.OOOE-hOO 


1/1 




5.975E-h03 


O.OOOE-hOO 


1/1 




20 


1.009E-h05 


O.OOOE-hOO 


1/1 




1.907E-h04 


O.OOOE-hOO 


1/1 




30 


1.705E-h05 


O.OOOE-hOO 


0/1 


1.535E-h01 


4.278E-h04 


O.OOOE-hOO 


1/1 




50 










6.248E-h07 


O.OOOE-hOO 


1/1 




60 










2.098E-h05 


O.OOOE-hOO 


0/1 


4.766E-09 


70 










3.748E-h05 


O.OOOE-hOO 


0/1 


3.892E-09 


80 










3.569E-h05 


O.OOOE-hOO 


0/1 


4.551E-09 


100 


2.000E-h05 


O.OOOE-hOO 


0/1 


8.635E-h01 


5.507E-h05 


O.OOOE-hOO 


0/1 




n 


Stewart 


Solis and Wets 


2 


7.350E-h02 


O.OOOE-hOO 


0/1 


2.553E-02 


3.780E-h03 


O.OOOE-hOO 


1/1 




3 


2.985E-h03 


O.OOOE-hOO 


0/1 


7.549E-01 


3.780E-h03 


O.OOOE-hOO 


1/1 




10 


4.820E-h02 


O.OOOE-hOO 


0/1 


8.713E-h00 


2.770E-h05 


O.OOOE-hOO 


1/1 




20 


1.580E-h02 


O.OOOE-hOO 


0/1 


1.880E-h01 










30 


2T70E-h02 


O.OOOE-hOO 
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2.870E-h01 










100 


6.370E-h02 


O.OOOE-hOO 


0/1 


9.799E-h01 










n 


Yuret and de la Maza 


Hansen, 


Ostermeier, Gawelczyk 


2 


3.910E-h02 


O.OOOE-hOO 


30/30 




5.043E-h02 


1.237E-h02 


5/5 




3 


1.998E-h03 


O.OOOE-hOO 


30/30 




1.025E-h03 


1.966E-h02 


5/5 




10 


5.866E-h04 


O.OOOE-hOO 


30/30 




1.027E-h04 


7.875E-h02 


5/5 




20 


1.418E-h05 


O.OOOE-hOO 


30/30 




4.398E-h04 


3.627E-h03 


5/5 




30 


2.345E-h05 


O.OOOE-hOO 


30/30 




1.129E-h05 


6.576E-h02 


5/5 




100 


5.827E-h03 


O.OOOE-hOO 


0/30 


9.698E-h00 











Strategy with covariance matrix adaptation [5]. Unfortnnately, both methods 
reqnire to solve complete eigenvalne problems. This operation is of order 0(ii^) 
thns increasing the compntational overhead considerably for high-dimensional 
problems. The best resnlts came np with the (1, A)-Evolntion Strategy with the 
generating set adaptation [6] (Table 1). This procednre was very robnst with in 
general less compntational effort compared to the other methods. Bnt the effort 
was still too high for problems having more than 100 variables. 

3 Adding Random Memorizing to Evolutionary 
Algorithms 

What are the lessons to be learned from the previons experiments? Eirst of all, 
good local search techniqnes do have a memory - either a seqnential one like 
the conjngate direction method [12], the qnasi Newton method [11], the Evo- 
Intion Strategies with generating set adaptation [6] or with covariance matrix 
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adaptation [5]- or a parallel one like the Breeder Genetic Algorithm with fnzzy 
gene pool recombination and covariance ntilization [9]. Having a closer look at 
all methods which nse a seqnential memory reveals that it is nsed in general in 
a rather deterministic or derandomized way for the constrnction of new search 
directions. On the other hand there are already preliminary investigations into 
direction mntations [2] withont nsing a memory. But the results are unsatisfac- 
tory. Therefore it was decided to analyze the impact of randomly accessing a 
sequential memory for the determination of new search directions. 



3,1 Conceptual Local Evolutionary Algorithm with Random 
Memorizing 

The basic idea is the following: Use any Evolutionary Algorithm (EA) which 
has a global step size control. Add a sequential memory to the Evolutionary 
Algorithm were previous best solutions in search space are stored up to a certain 
depth. Determine by means of the EA a better solution. Take a solution from 
memory at random and compute from the solution of the EA and the randomly 
memorized solution a search direction. Eollow this search direction by increasing 
multiples of the global step size as long as better solutions will be found. Then 
go ahead with the EA, etc. 

This procedure is formalized on the following page. Obviously, there are two 
parameters concerning the memory access. That is the memory depth cIm and the 
beam search factor b. Eurthermore the type of the probability density distribu- 
tion for the random access of the sequential memory is important. Eor simplicity 
we used a uniform distribution. Other distributions may be more appropriate. 



3.2 The Enhanced (1 + l)-Evolution Strategy 

To verify the outlined algorithm we instantiated it with the most simple EA, 
the (1 -h l)-Evolution Strategy. A thorough theoretical analysis of the (1 -h 1)- 
Evolution Strategy is given in [1] [4]. The parameters of the EA were set as 
recommended there. The memory depth was set rather high to dm = 2 • n...n^ 
and the beam search factor to 6 = 2. 

A first experiment was made for Rosenbrock’s function with n = 20 and the 
initial point Xi = 0, i = 1, ..., n resulting in a real surprise. This experiment is 
shown in Eigure 1. The (1 -h l)-Evolution Strategy with random memorizing is 
about four times faster then the (1, A)-Evolution Strategy with generating set 
adaptation. To add a further surprise it is about twice as fast then the (1,A)- 
Evolution Strategy with covariance matrix adaptation [5]. There is almost no 
computational overhead for the extraction of information from the memory. We 
then computed the minimum of /r for dimensions n = 5, ..., 200 and c = 10“^^ 
as shown in Table 2. Eor this function this method was superior to all other 
considered approaches with respect to efficiency and robustness. 

Einally, we made a benchmarking with the following functions: 

— /r = — 1)^) Rosenbrock function 
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— fg — Sphere function 

~ Ie — Hyper ellipsoid with condition index 10^ [6] 

— fp=-xi-\- Y1^-2 Parabolic ridge [1] 

Outline of the Local Evolutionary Search with Random Memorizing 



set generation counter g := 0 ; 

set memory counter c := 0; 

set beam search factor 6; 

set maximal number of generations gmaxj 

set memory depth dmj 

set initial values of variables Xg; 

set initial global step size dg^ 

set termination criterion c; 

set initial memory content m^, i — 0 , — 1 ; 

compute fitness / := /(x^); 
do 

do 

6 ^ := 6 ^ + 1 ; 

apply the EA for this generation fg^Xg^dg := 
until fg < fg-i; 

update memory m(c modd^) := 

select a memory content with r G {0, r c mod dm 
randomly drawn with uniform probability l/(d^ — 1); 
compute the direction from s := {xg — mr)l\\[xg — 
set beam search counter C5 = 0; 
set actual beam search factor ba := 1 ; 
set Xcf, := Xg; 
do 

Cb •— Cb 1 ; 

:= ba • b; 

compute new beam point Xc^^ := ^cb-i + ^ * * dg; 

while /(»cj < /(*c6-i); 
update memory m(cniodd„) := 
c := c + 1; 

Xg := Xqj^ ; 

until (/(*cj < e) or (ff > fi'mar)); 



The functions fp and fp were transformed for each run to a randomly gen- 
erated basis system [6]. The initial points were set in the corresponding basis 
system to Xi = 0 for fp and fp and Xi = 1 for fg and fp, i — 1, n. The runs 
were terminated at fstop < ^ = 10“^^ for fp, fg and fp. For fp it was set to 
fstop < f = —10^. For dimensions n = 5, 10, 20,40, 100 we made 30 runs each. 
The average number of function evaluation to reach the minimum with the given 
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Rosenbrock Function Evaluations vs. Dimension 





Fig. 1. Results for fn for the (1, A)-Evolution Strategy with generating set adaptation 
(GSAES) and the (1 + l)-Evolution Strategy with random memorizing (LESRM), n = 
20 (left), Results for the minimization of the functions given in the text, the parameter 
settings are explained in the text, average values are shown together with the standard 
deviations (right) 

Table 2. Results for the (1 + l)-Evolution Strategy with Local Memorizing for Rosen- 
brock’s Function 
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precision together with the standard deviation is depicted in Fignre 1. It is no- 
table that all rnns for all fnnctions converged with the given precision. Compar- 
ing the resnlts with data from [6] and [5] the local evolntionary search algorithm 
with random memorizing needs abont half the nnmber of fnnction evalnations 
than the (1, A)-Evolntion Strategy with covariance matrix adaptation and only 
a qnarter of the number of function evaluations of the (1, A)-Evolution Strategy 
with generating set adaptation. 

4 Calibration of a Laser Induced Atomic Emission 
Plasma Spectrometer 

Laser induced atomic emission plasma spectrometry is the latest development 
in a new generation of innovative measuring technologies. Such devices are em- 
ployed in various industrial applications for analytical control of raw materials, 
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products and processes. The advanced measuring method is based on the time- 
resolved spectral analysis of a light emitting laser-induced micro-plasma. The 
spectral range is 180 - 750 nm with a resolution of a few pm. An image inten- 
sifier and a camera are installed on the focal plane of the spectrograph. The 
measurement of spectra in two dimensions allows the simultaneous analysis of 
all relevant spectral lines. 60 elements can be determined simultaneously. The 
measurement device works in wavelength ranges higher 190 nm without protec- 
tive gas. Due to the small diameter of the laserbeam and to the fact that no 
protective gas is necessary, a sample preparation is not required. For the calibra- 
tion of such devices it is assumed that there is a relation between the intensity 
lik of emitted light at certain spectral lines and the concentration Cik for element 
i in calibration sample k. i.e. Cik — ^ 



The normalization has to be done to cope with concentrations. The vectors 
Pi are the calibration model parameters which has to be adapted such that a 
calibration error measure based on the known calibration sample concentrations 
Cik is minimized. Because of the normalization term all parameters are mutually 
dependent. This makes the problem difficult. Furthermore, it cannot be assumed 
that the model functions are continuous differentiable. Figure 2 shows calibration 
results for real data from five samples and five elements with a rather small 
calibration error. 



atomic emission plasma spectrometer calibration 
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Fig. 2. Intensities vs. known calibration sample concentrations o and estimated concen- 
trations H- based on the calibration model adaptation 
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5 Conclusions 

Evolutionary Algorithms can be upgraded by a sequential memory which is 
accessed randomly to generate promising new search directions. The computa- 
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tional overhead for exploiting the memory is neglectible. With a beam search 
along these new directions it has been shown that even for the simplest (1 + 1)- 
Evolntion Strategy a robnst and efficient algorithm can be designed which needs 
less fnnction evalnations than more sophisticated approaches. 

One prereqnisite that this approach will work is that the used Evolutionary 
Algorithm must be able to generate better solutions. Eor the considered (1 + 1)- 
Evolution Strategy this need not necessarily happen. It cannot work for points 
where the level set is such that increasing or decreasing the global step size leaves 
the success probability for finding a better solution at a constant very low value. 
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Abstract. Floorplanning of VLSI design is one of the key design flows which 
decides chip size, electrical characteristics, timing constrains, etc., of final 
silicon chip. Many useful floorplan tools are available in the industry. Those 
tools provide very user-friendly interactive environment and also provide 
useful information to proceed with chip design. However, construction and 
decision-making of floorplan design itself relies on the insight of human 
being. Therefore, the result varies depending on "who did it" and what initial 
condition was given at first. In this paper, authors propose an application of 
Genetic Algorithms to floorplanning for the purpose of providing better initial 
conditions as a starting point of design work to novice designers. A floorplan 
placement model suitable to Genetic Algorithm is discussed. Computational 
experiment is also carried out and results suggests practical possibility. 



1. Introduction 

According to the rapid increase of complexity & design difficulty of VLSI, 
EDA(Electronic Design Automation) tools have been widely adopted and have 
automated the many design flows. Such tools have been fulfilling the so called 
design gap between the increasing speed of silicon complexity and the improvement 
speed of design efficiency. 

In the case of layout design, the placement of primitive cells and routing 
problems are well automated by the long time effort [1][2][3]. Many useful layout 
tools are available in the marketplace or proprietary within companies already. 

Most of advanee digital VLSIs inelude not only sets of primitive eells but also 
memory bloeks, large maeros, sometimes analog bloeks, and so on. Therefore, floor 
planning has beeome the key design flow whieh deeides the final ehip size and 
performanee of silieon. Several floor plan tools are available in the marketplaee, 
and usually advaneed semieonduetor eompanies have their own floorplan tools. 
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Almost all such tools provide very user-friendly interactive environment of 
floorplan manipulation and provide useful information to improve the chip design 
such as estimated chip size, congestion of routing, connectivity among blocks etc. 
for chip designers. 

Several trials to automate floorplan placement have been reported [4] [5] [6]. 
However, in a usual industry field, construction and decision-making of floor plan 
design itself completely relies on the insight of designer, i.e. human being. 
Therefore, the result of floorplanning is heavily depend on "who did it". In general, 
an expert designer gives much better result than a novice or not well trained one. 

Authors are proposing an application of Genetic Algorithms to floorplan design 
of VLSI to provide a better initial condition as a starting point of complicated design 
for a novice chip designer. 

Genetic Algorithms (refereed as GAs hereafter) are popular methods in the area 
of Soft Computing. Several applications of GAs have been reported in the area of 
Design Automation [5] [7] [8]. 

In this paper, we will discuss how to apply GAs to floorplan problems and 
computational experiments are also presented. 



2. Genetic Algorithms(GAs) 




Locus 



Fig. 1. Virtual Life 



Fig. 2. Flow Chart of GAs Operation 
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2.1 Virtual Life 

Let a set of virtual lives be introdueed as illustrated in Fig. 1. Only ehromosomes 
are eonsidered and they eorrespond to the array on the eomputer memory whieh will 
be a set of solution of the problem. Parameters whieh are the objeets of optimization 
oeeupy the loeus as gene. The relation or eorrespondenee between those parameters 
and the nature of optimization problem is ealled as the problem of eoding. It is a 
very important portion of the applieation of GAs. GAs solve optimization problems 
by manipulating those ehromosomes. 



2.2 Basic Operations of GAs 

Fig. 2 shows the flow ehart of GAs operation. Basie operations are following. 

(1) Generation of an initial population. 

Generate an initial population by using a randomization method. 

(2) Evaluation of fitness 

Evaluate and eompare the virtual lives by the numerieal value ealled fitness. 

(3) Preservation of elite 

Preserve superior or outstanding individuals not to be destroyed by the next 
operation, erossover or mutation. 

(4) Seleetion 

Seleet two individuals for the erossover operation. Basieally it is preferable to 
seleet superior individuals if possible. However, if it is too mueh inelined to 
one side, there is some danger to fall into a loeal optimization. There are many 
kind of methods for the seleetion. We employed roulette wheel seleetion 
beeause higher probability of seleetion is expeeted for superior individuals but 
also there remains the possibility to ehoose individuals with lower fitness. 

(5) Crossover 

This operation means the generation of ehildren’s ehromosomes by 
reeombining parents’ ehromosomes. Two ehildren are generated from one pair 
of parents. One point or multi-point erossover operation is possible. In the ease 
of one point erossover, the first ehild’s ehromosome is identieal to the first 
parent until the erossing point. And after the erossing point, it is identieal to 
the seeond parent. 

Separated from the above, uniform erossover operation ean be eonsidered, 
whieh is, eaeh ehromosome position is erossed with some probability. In this 
paper, the uniform erossover operation is employed as illustrated in Fig. 3. 

(6) Mutation 

Mutation is an operation to ehange gene(s) randomly with some probability as 
shown in Fig. 4. Suitable mutation will give variety to the population and 
make possible to obtain a solution whieh eannot be derived from only the set 
of initial population. 
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One generation of GAs is one eyele of evaluation of fitness, seleetion, 
erossover, and mutation. A solution will be found by inereasing superior 
individuals in the population by repeating alternation of generation. 



Crossover bit string 

'lum 



Parent 1 ^ Parent2 

ampnoorm 




anxDanxD 



Child 1 Child 2 

Fig. 3. Crossover 



CDOOPO ) 

i / 

(Tunno 



mutant 



Fig. 4. Mutation 



3. Floorplanning 

The purpose of floorplan is to deeide a maeroseopie relative plaeement of bloeks on 
a ehip in advanee to detail plaeement and routing with eareful evaluation of final 
ehip size, wire length, total timing eonstraints, eleetrieal eharaeteristie and so on. 
Well experieneed designers proeeed with those operations eomprehensively, 
however, time required to eomplete the design is affeeted by the quality of initial 
eondition. In the ease of not well trained designers the design effieieney of both 
quality of resulted ehip and design time needed varies heavily depending on what 
kind of initial eondition was given. 

If a tool eould give better initial eondition of floorplan plaeement, it would be 
benefieial for both expert and not well experieneed designers. 



4. Application of Genetic Algorithms to Floorplan Placement 

A model of objeetive for plaeement is shown in Fig. 5. Reetangular bloeks are 
eonneeted by wires and an aspeet ratio and size of eaeh reetangular is given. A 
designated ehip area(both size and aspeet ratio) is also given. The following is basie 
operations: 

(1) Randomly generate an initial population 

(2) Evaluate fitness by ealeulating performanee flinetion 

(3) Generate ehildren: Preserve Elite, Seleet, Crossover, Mutate 

(4) Repeat (2) (3) 

(5) After eertain generation, ehoose the best individual 
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Netlist with shape of Blocks 



Fig. 5. Model of Objective for Placement 




X-Chromosome 




Y-Chromosome 




Rotation Chromosome 




Fig. 6. Coding 



4.1 Coding 

Correspondence between the chromosome and the target optimization problem is 
called coding. 

First of all, define the X-Y coordinate as shown in Fig. 6. Rotation of block is 
also taken into consideration. Prepare three(3) chromosomes and let them be X 
chromosome, Y chromosome, and Rotation chromosome, respectively. In Figure 6, 
#0 block of coordinate (XO, YO) and rotation 0 is expressed as follows: 

• 0th gene of X chromosome is XO 

• 0th gene of Y chromosome is YO 

• 0th gene of Rotation chromosome is 0 

#1 block of (XI, Yl) and 90 degree rotation is expressed: 
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• 1st gene of X chromosome is XI 

• 1st gene of Y chromosome is Y1 

• 1st gene of Rotation chromosome is 1 

Overlapping and protrusion of blocks is allowed in this coding. We employed the 
method to exclude overlapping and protrusion by the evaluation of performance 
function. 



4.2 Performance Function 

Based on the behavior of expert designer, we chose minimum items for the 
component of performance function. Virtual wire length L., area of overlapping S., 
and area out of designated boundary OV^ are taken into consideration for evaluation. 
Fig. 7 shows those item’s calculation. L. is virtual length of wires between centers of 
each block weighed by number of connection. 

The performance function is given by 

E = 0^ Li + 

i i i 

where a, (3, and yare tuning parameters. 






OVi 



(4.1) 



4.3 Mutation 

Mutation is implemented by moving randomly selected blocks with certain 
distance(distance decided also randomly) as shown in Fig. 8. 
























designated 

area 




ov^- 







(C) Protrusion 



Fig. 7. Evaluation Items of Performance Function 



Fig. 8. Mutation Method 
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5. Computational Experiment 

Computational experiment has been earried out. We have applied GAs to a test 
eireuit eonsist of 16 bloeks with 1000, 3000, and 10000 individuals as an initial 
population respeetively. Repetition of generation evolution was set 300. The 
mutation probability was employed as 0.5% based on experimental result. 

Fig. 9 shows the value E of the performanee flmetion defined by (4.1) against the 
number of generation. In the ease of 1000 individuals the performanee funetion was 
not improved after 80 generations. It is eonsidered that a loeal optimization 
oeeurred. In the ease of 3000 individuals and 10000 individuals, E eonverges at 
about 160th generation and both eases gave similar results. The ease of 3000 
individuals as an initial population seems enough for this eireuit. 

Some results of plaeement obtained by this trial are shown in Fig. 10, 11, 12, and 
13. Fig. 10, Fig. 11 and Fig. 12 show the best individual at the initial population, at 
the 50th generation, and 100th generation respeetively. Fig. 13 shows the final result 
after the repetition of 300 generations. 

Impression obtained by interviewing to experieneed designers showed 
sensuously positive opinions as initial eonditions for a starting point of 
floorplanning. 



6. Conclusion 

An applieation of Genetie Algorithms to VLSI floor planning was proposed. 

A eomputational experiment on a test ease eonsist of 16 bloeks shows a praetieal 
result as an initial eondition for noviee designers. 

Also for the size of above stated ease, the result showed that 3000 initial 
individuals was enough but 1000 was not enough for the applieation of GAs. 




Fig. 9. Performanee Funetion of the Best Individual in Eaeh Generation 
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Fig. 10. 1 St Generation 



Fig. 11. 50th Generation 





Fig. 12. 100th Generation Fig. 13. 300th Generation 
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Abstract. We apply the idea of learning with delayed rewards to im- 
prove performance of the Ant System. We will mention different mecha- 
nisms of delayed rewards in the Ant Algorithm (AA). The A A for JSP 
was first applied in classical form by A. Colorni and M. Dorigo. We adapt 
an idea of an evolution of the algorithm itself using the methods of the 
learning process. We accentuate the co-operation and stigmergy effect in 
this algorithm. We propose the optimal values of the parameters used in 
this version of the AA, derived as a result of our experiments. 



1 Introduction 

The optimization algorithm that we propose in this paper was inspired by pre- 
vious works on ant systems, and in general, by the term - stigmergy. This 
phenomenon, called stigmergy, was first introduced by P.P.Grasse [GRAS59] 
to present the indirect communication taking place among individuals through 
modifications induced in their environment. 

The first application of the AA to the JSP was described in [G0L094]. In this 
paper authors showed how a new heuristic called ant system can be successfully 
applied to find good solutions of this problem. The effectiveness of the algorithm 
is due to the co-operation among the agents (a positive feedback or autocatalytic 
process). It was shown in the experiments with MTIO, ORBl, ORB4 and LA21. 
Our intention is to present a new concept of learning among the ants. We propose 
different versions of reward functions. 

2 Background 

This paper introduces a family of ant algorithms where distributed agents (arti- 
ficial ants) with some kinds of knowledge solve the JSP problem. The ant system 
we are going to apply simulates the real world and features of this system are 
presented in [BORY97]. 

The Ant System utilizes a weighted graph {N^E) with the set of nodes 
{N) and the set of edges between nodes weighted by an Utility measure {E). 
Each edge (i, j) has also a Reward measure Z\r(i, j), called pheromone, which 
is updated by artificial ants. Different choices of increasing the Reward measure 
(pheromone) cause different versions of the AA: AntDensity, AntQuantity and 
AntCycle [C0L092]. 
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The mentioned AntCycle algorithm (the best of the AA) may be presented 
in the notation of Evolution Computing [MICH96] in the way described in 
[BORY97]. Below we discuss the node transition rule and the pheromone re- 
inforcement rules. The node transition rule used in this version of Ant System, 
called random proportional rule, is given by Eq.(l), which describes the proba- 
bility with which agent k in node i choose the next node j to move to: 



PtiiJ) 



y lUtih(i,j)V 

j e allowed 

0 



if j G allowed 
otherwise 



( 1 ) 



where: allowed S, f3 are described in [C0L092], Util{i^j) is a utility measure, and 
He{i^j) is some kinds of heuristics helping to solve the JSP (in our experiments 
the rule LRT). 

The pheromone reinforcement rule is intended to allocate a greater amount 
of pheromone to a shorter schedule. This rule is similar to the reinforcement 
learning scheme [SING97], [GAMB95], in which better solutions get a higher 
reinforcement. 

Pheromone trail laid on the edges plays the rule of a delayed reward and 
memory for the each agents will probably chose this edge. This allows an indirect 
form of communication and co-operation between these agents. 

The Utility measure is a sum of present rewards and long-term rewards 
received during the iteration time, according to the formula: 



N-l 

UUlt{i^j) = Rewt{i^j) + 7 • max E Rewt+n{i,j) ( 2 ) 

n=l 



where Rewt{i^j) = Art{i^j) — delayed reward, Art=o{i^j) = tq, n is the number 
of steps of the algorithm and 7 is a discount factor (like in the Reinforcement 
Learning [SING97]) (0 E 7 E !)• In onr experiments we tested 7 = 1 . / is the 
next node and k is the number of ants. We must emphasize that we lay the 
pheromone trail on appropriate edges after the cycle algorithm is complete. 

The Reward measure, connected with increase of the pheromone trail (equal 
to the global reinforcement rule) is computed according to the formula: 



Z\r(i, j) 



J if {i,j) G global-best allocation 
\ 0 otherwise 



(3) 



where Lqi is a length of the globally best solution going from the beginning of 
the trail and Q is a parameter (like in Q-learning [DORI94]). 

Einally, in the ant system, the pheromone level is updated by the formula: 

Utilt+i{i,j) = (1 - o) * (4) 



where I is the next node and k is the number of ants and a is the learning factor, 
equal to a pheromone evaporation parameter. 
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We also tested another type of updating rule, called local-best, which used 
Lioc (the length of the best allocation in the current iteration of the trail) instead 
of previous one rule. As a matter of the local-best rule, the edges which receive 
reinforcement are those belonging to to the best allocation received in the current 
iteration. Experiments have shown that the difference between those two schemes 
is minimal, with a slight preference aimed to the global-best, which is therefore 
applied in following experiments. 

An application of the A A (in the classical form) to the JSP was described 
(with many details) in [C0L094]. 

3 Examining the efficiency of the A A for the JSP. 
Conclusions 

First, the classical A A — Ant Cycle was tested. The problem 8 x 10 was tested 
using the different sets of parameters [C0L092]. Analyzing the results, we can 
conclude that the classical AA was finding the best results (the shortest process- 
ing time) for the greatest value of /? and for the small values of 6. The influence 
of parameter p (in analyzed range) was unimportant. 

Now we presents the results of tests for the 10 x 10 JSP with the sets of 
parameters described in previous section. Parameters Q and Ql had values 700 
and 1400, respectively. All the results were averaging after 10 experiments of 1000 
iterations. The investigations were carried on for the following delayed rewards 
[BORY97]: 



name 


denotation 


global leader 


global I 


reinforced global leader 


global II 


elite 


global IV 


local leader 


local I 


reinforced local leader 


local II 



For the set of parameters (Fig.l.), the advantage of the rules based on the 
global reinforcement may be observed. The results were about 4% better than 
for the classic algorithm. It was a result of decrease the relative importance of 
heuristics f3 (from 20 to 5). Using this set of parameters the shortest processing 
time for the 10 x 10 JSP was obtained (646 time units for the global I). 

From the described results we can derive the following conclusions: the re- 
inforcement rules we introduced increased the efficiency of the AA regardless of 
the size of the analyzed JSP; suitably chosen values of the parameters ( J, /?, p, Q 
and Ql) ensure the A A with any delayed rewards based on the global informa- 
tion (global reinforcement rule) to give the results about 5% better than for the 
classic algorithm; the global IV seems to be the best rule; the values assigned to 
the parameters Q and Ql are important just as the value of the parameter 4. 

We have also examined, how the number of agents (ants) influences the be- 
havior of the generative policies for the 10 x 10 JSP. The results for 100 agents 
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Fig. 1. Results for ^ = 5, /? = 5 and p = 0.9. 

were: 625 time units for the shortest processing time and 645 time units for the 
average of the bests results. 

Artificial ants co-operate by updating information via the pheromone trail 
which help them to find the optimal solution very quickly. The experiments show 
very important role of heuristics that agents use, and of how this heuristics 
influences the transition rule. Thus, in further experiments with the JSP, the 
different types of heuristics will be tested. 
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FUZZY EXTENSION OF ROUGH SETS THEORY 
(EXTENDED ABSTRACT) 

GIANPIERO CATTANEO 



1. Abstract Rough Approximation Spaces 



In this paper we consider many approximation spaces for rough theories, all of which are 
concrete realizations of an abstract structure stronger than the one introduced in [ 2 ] and defined 
in the following way: 

21:=(E,P(E),<,0,1) 

where: 

1. {E,A.V,0,1) is a complete lattice with respect to the partial order relation <, bounded 
by the least element 0 {ix G E, 0 < a?) and the greatest element 1 [\fx G S, x < 1); elements 
from S are interpreted as concepts, data, etc., and are said to be approximable elements; 

2 . I>(S) is a sublattice of E whose elements are called definable; 

and satisfying the following axioms: 

(Axl): For any approximable element a? G E, there exists (at least) one element i(a:) such 

that: 



( 1 . 1 a) i(x) EV(Ti) 

(1.1b) i(a?) < X 

(1.1c) VaGX>(E), (a < ic o: < i(a?)) 

( Ax2): For any approximable element G E, there exists (at least) one element o(x) such 

that: 



(1.2a) o{x)eV(T.) 

( 1 . 2 b) X < o(x) 

(1.2c) V 7 G X^(E), (a^ < 7 => o(a?) < 7 ) 

i.e., i(a:) [resp., o(a:)] is the best approximation of the “vague” element x from the bottom 
[resp., top] by definable elements. 

For any approximable element a: G E, the inner and the outer definable elements i(a?),o(a?) G 
P(E), whose existence is assured by (Axl) and (Ax2), are unique. Thus it is possible to introduce 
the inner approximation mapping, i : E i-A I^(E) and the outer approximation mapping o : E i-A 
P(E), defined for every a: G E respectively as 

(1.3) i(a:) := max{a G ^(E) : a < a:} o[x) := min {7 G ^(E) : a? < 7 } 



The rough approximation of any approximable element a? G E is then the ordered inner-outer 
pair 



(1.4) 



r[x) := (i(a:), o(a?)) [with i(a?) < x < o(a:)] 



which is the image of the element x under the rough approximation mapping r : E 1-4 X>(E) x X^(E) 
pictured by the following diagram: 



L. Polkowski and A. Skowron (Eds.): RSCTC’98, LNAI 1424, pp. 275-282, 1998. 
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X eT> 




Following [2], an element of X is said to be crisp (also exact) if and only if its inner and outer 
approximations coincide: = o{x), equivalently, iff its rough approximation is the trivial one 

r(a?) = (x,x). Owing to (1.3) this happens iff x is definable; therefore, X>(D) is the collection of 
all crisp elements. 

2. The Orthodox Pawlak Approach to Rough Set Theory 

The usual approach to rough set theory as introduced by Pawlak [10, 11] is formally based 
on a pair (X, 7t(A)) consisting of a nonempty set X, the universe [with corresponding power set 
'P(A), the collection of approximable sets], and a partition 7 t(A) := {Mi E V{X) : i G 1} of X 
whose elements are the elementary sets. The partition 7 t(A) can be characterized by the induced 
equivalence relation TZ C X x X, defined as (x,y) ElZ iff E 7t(A) \ x,y E Mj ',\n this case 

we say that x,y are indistinguishable with respect to 1Z and the equivalence relation IZ is called 
an indistinguishability relation. 

A definable set (or simple proposition) is any subset of X obtained as the set theoretic union 
of elementary subsets: Mj = U{Mj E 7t(A) : j E J C /}; the collection of all such definable sets 
plus the empty set will be denoted by n(X) and it turns out to be a Boolean algebra (orthocom- 
plemented atomic distributive complete lattice) {n(A),n,U, ^,0, A) with respect to set theoretic 
intersection, union, and complementation. 

To any subset of the universe H EV{X) one can associate the lower (also inner) approximation 
(2.1) i{H) := U{Mj G U(X) : Mj C H} 

and the upper (also outer) approximation 
(2.2a) o(H) : = f){Mj E U(X) : H C Mj} 

(2.2b) = U{Mj G n(A) :HnMj^9} 

In this way, according to Section 1, we can construct the concrete approximation space 

(2.3) ^={V{X),U(X),i,o) 

consisting of: (1) the boolean (complete) lattice V(X) of all approximable subsets of the universe 
X; (2) the boolean (complete) lattice n(A) of all definable subsets of X; (3) the inner approxi- 
mation map i : V(X) n(A) associating with any approximable set H its inner approximation 
i(H) defined by (2.1); (4) the outer approximation map o : V(X) i-^ fI(A) associating to any 
approximable set H its outer approximation o{H) defined by (2.2). 

The rough approximation of an approximable set H is the pair 

(2.4) r(H) := (i{H),o{H)) , with i{H) CH C o{H ) . 

3. Approximation Spaces Induced from a Knowledge Representation System 

Knowledge representation systems are generally introduced as models for concepts and relations 
among them] a number of such models have been proposed in the literature, mainly in connection 
with Artificial Intelligence. In this paper we are concerned with knowledge representation systems 
as they are understood in Rough Set Theory: “The main theoretical concept in this [...] approach 
is the notion of knowledge representation system (KR-system) introduced by Pawlak in [9] under 
the name of information system. A KR-system is a formalism for representing knowledge about 
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some objects in terms of attributes (e.g., color) and values of attributes (e.g., green).” [13]. To be 
precise, a KR-system is a structure 

/C7^ {X,Att{X),val{X),F} 

where X is a nonempty set of objects (situations, entities, states); Att(X) is a nonempty set of 
attributes valuable on objects of the set X; val(X) is the set of possible values which can be 
assumed in any observation on objects from X; F is a mapping F : X x Att(X) val(X) called 
information mapping. 

For any fixed attribute a G Att(X) we denote by val(a) {A G Att(X) | 3a? G X : F(x, a) = A} 
the set of all possible values of attribute a; the pair (a, A) is an elementary question describing 
the sentence: “A test of the attribute a yields the value A.”. Moreover, the chosen attribute 
determines a map fa : X ^ val(a), fa(x) := F(x,a). which, in analogy with physics, can be 
viewed as an observable magnitude defined on the phase space X and assigning to every state 
a? G X the value /^(a?) assumed by the observable a when the system is in the state x. 

With any elementary question (a. A) we can associate the subset of states 

(3.1) Ma{X) := /-‘(A) = {xeX: /,(x) = A} 

consisting of all states in which a measure of the observable a produces the result A. Trivially, the 
collection of all such subsets of X, 

(3.2) 7T«(X) := {Ma(X) e V(X) : A G val(a)}. 

is a partition of X, called the a -partition. The indistinguishability relation induced from this 
Qf-partition is (x,y) G ilf fa(^) = fa(v) • 

Example 3.1. Consider the KR-system: 

X = {1, 2, 3, 4}, Att(X) = {ao, ai, as}, val(X) = {¥, R, G, M, L, S, A, T) 

The observables describing the attributes under consideration are given in the following table: 



xex 


fao (^) 


fai (^) 




1 


A 


G 


s 


2 


A 


R 


s 


3 


T 


Y 


M 


4 


T 


Y 


L 



We have three a-partitions, one for each attribute: 

= {{1,2}, {3, 4}}, w^,{X) = {{1},{2},{3,4}}, w^,(X) = {{1, 2}, {3}, {4}}. 

The following are rough approximations of the same subset of states: 

r„„({2,3,4}) = ({3,4},{l,2,3,4}) ({2,3,4}) = ({2,3, 4), {2,3,4}) 

r„,({2,3,4}) = ({3,4},{l,2,3,4}). 

Notice that the ai-approximation of {2,3,4} is crisp. 

The meaning of the attributes described in this example could be the following: ao is the shape 
of the object (A as arched and T as thin), a\ is the color (X as yellow, G as green and R as red), 
as is the dimension (S as small, M as medium and L as large). 

Coming back to the general case, consider the measurable space of (micro) states (X,V(X)); 
moreover, for any attribute a, consider the pair (val(a),V(val{a)), consisting of the a-value set 
and of its power set, as the measurable space induced by a. Then, the a-definable sets are of the 
form: 

(3.3a) VA G V(val(a)), M«(A) : = U {Ma(X) G n^iX) : A G A} 

(3.3b) = {x £ X : Ux) e A} = f;\A) 

In this way we can also describe the attribute a as the random variable f~^ : V(val{a)) \-^V(X) 
satisfying the following conditions: 

{RVi) f-^(val(a))=X- 
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(RV 2 ) Ai n A 2 = 0 implies * (Ai) n /„ '(A 2 ) = 0; 

{RVs) for any family {A„} of pairwise disjoint subsets of val{a) we have that 

/-'(UA„) = U/-i(A„) 

Note that on the basis of conditions {RV\) — {RV 3 ) it is possible to infer the following further 
properties of a random variable: (a) /«^(0) = 0; (b) — X \/“^(A); (c) for any 

family {A„} of pairwise disjoint subsets of val{a): = f)f~^{An). 

4. Realization of Pawlak Rough Approximation Spaces by Sharp Identity 

Resolutions 

The power set V{X) of the universe X and the family of all characteristic functionals ({0, 1}- 
valued functions on X) are in a one-to-one correspondence with respect to the map associating to 
any subset A of A the functional : A {0, 1} defined as := 1 if a? G A, 0 otherwise. 

In the case of an universe A which is a compact metric space with respect to some distance 
function d : X x A defined on it, the set {0, 1}^ of all {0, l}-valued functions on A can 

be considered as embedded in the C* algebra (R(C^ ), -h, *,|| • ||,0,l) of all complex-valued 

bounded functions defined on A, equipped with the usual pointwise sum and product operations, 
the adjoint involution operation * which yields the complex conjugation of functions, and the 
norm of uniform convergence ||/|| := sup{\f{x)\ : x £ X}. In particular {0, l}-valued functions 
Xa satisfy the property of being projections of the C* algebra that is, they are (1) self- 

adjoint (i.e., real valued); (2) bounded; (3) idempotent [(xa)^ = Xa]* 

In what follows, extending this terminology to the case of a general universe A, {0, l}-valued 
functions on A are said to be projections. 

The set {0, 1}^ of all characteristic functionals on A determines an atomic Boolean (complete) 
lattice ({0, 1}^, A, V, -I, 0 , 1 ), where 0 and 1 are the characteristic functionals of the empty set 
[Va^ G A, O(a^) = 0] and of the whole universe [ix G A, l{x) = 1], respectively. The operations A, 



V and -I 


are defined Va? G A by the laws: 


(4.1a) 


(XA A Xb){x) = min{xA(»), Xb(«)} 


(4.1b) 


= max{0,x>i(*) + Xb(x) - 1} 


(4.1c) 


= Xa{x)-Xb{x) 


(4.2a) 


(Xa V Xb)(x) = max{xA(»), Xb(x)} 


(4.2b) 


= min{l,x^(*) + Xs(«)} 


(4.2c) 


= Xa{x) + Xb(*) - Xa{x) ■ xb(x 


(4.3a) 


{-'Xa){x) = (1 - xa}{x) 


(4.3b) 


_fl, iffx.4(a;) = 0 
\o, iffxA(«) = l 



The mapping x • {0, A i-A- xa is clearly a boolean lattice isomorphism identifying 

V{X) and {0, 1}^, since 

(4Aa) xahb = Xa^Xb 

(4Ab) Xauh = Xa V xb 

(4.4c) XA- = --XA 

This isomorphism preserves also the partial ordering relations since A C R if and only if Va? G A, 
Xa{x) < Xb{x). 

Consider now a rough approximation space 21 determined on an universe A of finite cardinality 
by the (finite) partition 7 t(A) — {Mi : i G /} of elementary set (the extension to a generic universe 
with a countable partition is straightforward). The corresponding collection of characteristic 




Fuzzy Extension of Rough Sets Theory 279 



functionals {P{i) := Xm, •«€/}, borrowing the terminology from Functional Analysis, is a sharp 
identity resolution P : / i-^ {0, 1}^, i P{i) = xm,? in the sense that it satisfies the following 
properties: 

(sir-1) all P{i) are nonzero projections [real valued, bounded, and idempotent functions]; 
(sir- 2 ) \fi ^ j, P{i) < ->P(j) equivalently P{i) • P{j) = 0 [orthogonality condition]; 

(sir-3) ~ - [identity resolution]. 

The corresponding Boolean lattice (cr-algebra) n(X) = {Mj : J G P(I)} of definable sets can 
be represented by the family of characteristic functions (projections) {P(J) := XMj • J ^ ^(-^)) 
which generate a Projection Valued (PV) measure P : V{I) •-> {0, 1 }^, J — )■ P{J) = XMj, in the 
sense that it satisfies the following properties: 

{PVi) P(/) = l; 

(PV 2 ) 7i n J 2 = 0 implies P{Ji) < -P( J 2 ); 

(PV 3 ) for any family { J„} of pairwise disjoint subsets of / we have that 

In the case of a KR-system of finite cardinality and with real set of values val(X) C M, the 
a-partition 7 Ta(X) = {Ma(Ai), M«(A 2 ), . . . , Ma{Xn{a))} i determined by an attribute a with finite 
set of possible real values val(a) = {Ai, A 2 , . . . , A„(ct )}5 introduced for simplicity the notation 

(4.5) Pa(X) := Xm«(A) , 

gives rise not only to the identity resolution {Pa(Ai), Pa(A 2 ), . . . , Pa (An (a))} but also to a spectral 
identity resolution of the real observable fa- This means that the following condition is also 
satisfied: 

(sir-4) 

n(a) 

fa =Y, 

i=l 

Example 4.1. If in Example 3.1 we represent the values Y, R,G, L, S, A,T by distinct real 
numbers, then obviously 

fao — A • X{2,3} + B ' X{3,4}? fai — G ' X{1} + P ' X{2} + ^ ' X{3,4}? 
foc2—B‘ X{1,2} + ^ • X{3) + L • X{4} • 

Note that in the present context of a crisp representation of P(X), if for the sake of simplicity 
we introduce the notation 

(4.6) 

in analogy with the sharp approach to axiomatic foundations of quantum mechanics, the conditions 
{RVi) — (RVs) above assume the form of a Projection Valued (PV) measure: 

(PVi) Pa(val(a)) = l; 

(PV 2 ) Ai n A 2 = 0 implies Pa(Ai) < ->Pa(A 2 ); 

(PV 3 ) for any family {A„} of pairwise disjoint subsets of val(a) we have that 

P„(UA„) = ^P«(A„). 



5. Fuzzy Rough Approximation Spaces 

Recall that the notion of characteristic functional on the universe X can be generalized to the 
notion of fuzzy set defined as a [0, l]-valued function on A, / : A [0, 1]. The most interesting 
algebraic structure involving fuzzy sets is the BZMV algebra of De Morgan type (see [4, 3]): 

(5.1) ([0,lp, 0,1) 
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where the operations are defined as follows: 



(5.2a) 

(5.2b) 

(5.2c) 

(5.2d) 



(/ ® P)(*) := min{l, /(x) + ^(x)} 

(/ © S)(*) := max{0, /(x) + g{)s) - 1} 
-./(x) :=(l-/)(x) 



~f(x) 



f 1, f(i:) = 0 

\ 0 , 



Following Chang [7, 8], the algebraic substructure ([0,1]^, 0,0, is a standard MV 

algebra. Recall that in any MV algebra the following operations can also be introduced: 



(5.3a) (/Air)(x) := [(/©-.ir)0if](x) 

(5-3b) if V g){x) [(f Q ->g) e g]{x) 

(5.3c) (f g){x) ■■= {-'f e g){x) 



The first two new operations are binary lattice operation of meet and join generating the partial 
ordering 



(5.4) f<g iff fAg = f iff f g = I 

Trivially, on fuzzy sets the above operations assume the forms: (1) {f/\g){x) = mm{f(x), g{x)} 

and (2) (/ V^)(a?) = ma,x{f{x),g(x)}] moreover, the partial ordering (5.4) turns out to be the 

pointwise ordering on real valued functions: / < ^ iff G X, f{x) < g{x). The third bi- 
nary operation corresponds the Lukasiewicz implication connective for many- valued logics [12]: 
(/ ~^L g)(x) = min{l, 1 - f(x)-\-g(x)}. 

With respect to the partial ordering the substructure ([0, 1]"^, A, V, 0, l) is a Brouwer- 
Zadeh (BZ) distributive (complete) lattice equipped with two nonclassical negations: the Kleene 
negation -i (possibly violating the noncontradiction principle, /A->/ 0, and the excluded-middle 

principle, f V ~>f ^ 1) and the Brouwer negation ^ (possibly violating the strong double negation 
law, / /, and the excluded middle principle, /V ^ / 7^ 1) [5, 6]. 

To any fuzzy set / G [0, 1]^ we can associate the two subsets of the universe X 

(5.5) /(/) := {ar G X : f(x) = 1} , 0(f) {x e X : f(x) G (0, 1]} , 



called the inner support and the outer support respectively. 

In the context of fuzzy set theory, one can construct the fuzzy rough approximation space 

(5.6) 21; = ([0,1]^,{0,1}^,D,0) 

consisting of: (1) the BZ distributive (complete) lattice [0, 1]^ of all fuzzy sets as approximable 
elements; (2) the boolean (complete) lattice {0, 1}^ of all crisp sets as definable elements; (3) 
the inner approximation map □ : [0, 1]^ i-4 {0, 1}^ associating with any fuzzy set / its (crisp) 
necessity □(/) = X/(/)? i*^*? b^st approximation of / from the bottom by crisp sets. (4) 
the outer approximation map <0 : [0, 1]^ {0, 1}^ associating with any fuzzy set / its (crisp) 

possibility 0(f) = Xo{f)i i-^*? best approximation of / from the top by crisp sets. 

Recall that the mappings Q and <0 are modal operators on the Kleene distributive lattice of 
all fuzzy sets, i.e., they constitute an algebraic KS^ model of modal logic, where K means that 
the basic lattice structure is not a Boolean algebra, but a Kleene algebra [3]. 

The corresponding rough approximation of a fuzzy set / is then the pair: 

(5-7) r(f) := {xi(f),Xo(f)) , with Xi(f) < f < Xou) > 

which can be identified with the pair of subsets of X consisting of the inner and the outer supports: 



(5.8) r{f) = (1(f), 0(f)) , with 1(f) C 0(f) , 
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6. Realization of Rough Approximation Spaces by Unsharp Identity Resolutions 

In analogy with the approach to unsharp quantum mechanics, we can construct rough approxi- 
mation spaces generalizing the notion of identity resolution to the case of fuzzy identity resolution. 

A fuzzy identity resolution [also fuzzy partition of the universe, see [1]) is any collection 7Tf[X) = 
{/i5 /2} • • • ? /n} of fuzzy sets satisfying the following conditions: 

(fir-I) all fi are real valued, non zero [fi ^ 0], positive and absorbing [0 < fi < 1] (in other 
words fi G [0, 1]^); 

(fir-2) Vi ^ j, fi < ->/j (which in general does not imply fi • fj = 0) [orthogonality condition]; 
(fir-3) fi = 1 [identity resolution]. 

To any fuzzy identity resolution 7T/(A) of X we can associate two families of subsets: 

(6.1) :={/(/i)./(/2),..../(/n)}, 0 { nj { X )) :={0(/i),0(/2),...,0(/„)} 

called the inner (partial) covering and the outer covering of X induced from 7Vf{X) respectively. 

The inner granule and the outer granule determined by the fuzzy partition nf(X) on the point 
X E X are defined as: 

(6.2) gi(x) := D {/(/<) : x £ /(/,)} , g„(x) := n {0(fi) : x £ 0{fi)} . 

The inner granule may of course be empty, but the following chain of inclusions holds: 

gi(x) C {x} C go(x ) . 

We now want to show how an unsharp (fuzzy) realization of any sharp identity resolution can 
be obtained in a canonical way. Consider a random variable /^ : A val(X) associated to an 
attribute a from a KR-system; and let {T*a(A) = : A G val(a)} be the corresponding 

spectral identity resolution of /, satisfying in particular the spectral condition (sir-4). 

Then, for any function u : val{a) i-4 [0, 1] (whose range is necessarily finite) we can introduce 

(6.3) ^ «(A).P„(A) 

\£val{a) 

which is a fuzzy set : X u(val(a)) defined for any x £X hy the law: 

(6.4) F^^\x)= Y. 

\^val{oi) 

This fuzzy set is realized on the same partition 7Ta(A) = ^ ^ val{a)} of the attribute 

a; the set of possible values is however changed from {Ai,A2,... ,A^} G M to the new values 
{u(Ai),w(A 2),... ,u(An)} C [0,1]. 

We can state the following result. 

Theorem 6.1. Let Q — {uk : val{a) [0, 1] | A: G IK C M}, with |K| < oo, be a finite family of 
[0,l]-valued maps all defined on the value set val{a). For every k construct the corresponding 
fuzzy set 

(6.5) Fi’‘\k):= Y «fc(A).Pc(A)€[0,lp. 

\Eval{a) 

Then, the induced map 

(6.6) : K [0, 1]^, k -t F^'^'>(k) 

is an fuzzy identity resolution (ik £ K, Fi^\k) ^ 0, and — 1] [0,l]-valued 

‘‘matrix’^ (uk{\) : Ar G IK, A G val[a)) is stochastic, i.e., 

VA G val(a), ^Wfc(A) = 1 . 



(6.7) 
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1 Introduction 

The rough sets theory proposed by Pawlak [8,9] was originally founded on the 
idea of approximating a given set by means of indiscernibility binary relation, 
which was assumed to be an equivalence relation (reflexive, symmetric and tran- 
sitive). With respect to this basic idea, two main theoretical developments have 
been proposed: some extensions to a fuzzy context (e.g. Dubois and Prade, [1,2], 
Slowinski and Stefanowski, [13,14,15], Yao, [19]) and some extensions of the 
indiscernibility relation by means of more general binary relations (e.g. Niemi- 
nen, [7], Lin, [5], Marcus, [6], Polkowski, Skowron and Zytkow, [10], Skowron 
and Stepaniuk, [11], Slowinski, [12], Slowinski and Vanderpooten, [16,17,18], 
Yao and Wong, [20]). In the latter extensions, we wish to point out the proposal 
of Slowinski and Vanderpooten( [16,17,18]) who introduced and characterized a 
general deflnition of rough approximations using a similarity relation which is a 
reflexive binary relation, relaxing the assumption of symmetry and transitivity. 

In this paper, based on Greco, Matarazzo and Slowinski [4], we put together 
these two extensions, considering within a fuzzy context the approach proposed 
by Slowinski and Vanderpooten. More speciflcally we propose to approximate a 
given fuzzy set by means of reflexive fuzzy binary relations. 

The paper is structured as follows. In section 2 we introduce the rough ap- 
proximation of a given fuzzy set by a fuzzy similarity relation. In section 3 we 
present an application of the proposed approach to an exemplary problem. Sec- 
tion 4 groups conclusions. 

2 Rough approximation by reflexive fuzzy binary relation 

In the following the negation and the classic connectives of fuzzy logic are used 
in a suitable way, in particular those of the t-norm T as conjunction, of the 
t-conorm S as disjunction, of the fuzzy negation N and of fuzzy implication 
(for a brief but thorough introduction to fuzzy logic see the first chapter of Fodor 
and Roubens [3]). Let U be a (non empty) finite set of objects called universe 
and R a reflexive fuzzy binary relation defined on U, i.e. R : U x U [0,1] 
such that R(x, x) = 1 for each x e U. R represents a certain form of similarity. 
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On this basis we can extend the concepts of positive and negative objects, well 
known within classical rough sets theory (e.g. see Pawlak [8]), to a fuzzy context. 

Let X be a fuzzy set in U and let also /xx : ^ ^ [0, 1] be the membership 
function of X. Given x e we say that: 

1. the membership function of x to the set of positive objects with respect to 
X, denoted by Pos(x,X), is the credibility that “for each y e U x is not 
similar to ^ or ^ belongs to X”, i.e. 

Pos(a;,X) = Ty^uiSiN{R{x,y)),fix{y))); 

2. the membership function of x to the set of negative objects with respect to 
X, denoted by Neg(x,X), is the credibility that “for each y ^ U x is not 
similar to ^ or ^ does not belong to X” , i.e. 

Neg{x,X) = Ty^uiSiNiR{x,y)),N{i,xiy))))- 

Let us remark that, remembering the definition of 5-implication (see e.g. 
Fodor and Roubens [3]), we can write 

Pos(a;,X) = Ty^uiI^NiR{x,y),yx{y))), (1) 

Neg(a;,X) = Ty^uiI^NiRix,y),N{yx{ym- (2) 

On the basis of Equation 1, Pos(x,X) can be seen as the credibility that “for 
each y G U the similarity of x with respect to y implies that y belongs to X”. 
Analogously from Equation 2, Neg(x, X) can be seen as the credibility that “for 
each y ^ U the similarity of x with respect to y implies that y does not belong 
to X”. 

Considering a fuzzy set Xin^7 and a reflexive fuzzy binary relation R defined 
on f/, the lower approximation of X, denoted by ^(X), and the upper approxi- 
mation of X, denoted by i?(X), are fuzzy sets of U whose membership functions 
are respectively defined as 

R{X)) = Tyeu{S{N{R{x, y )) , /xx(^))), 

y{x,R{X)) = Sy^u{T{R{x,y),p.x{y)))- 

Let us observe that y{x^R{X)) = Pos(x,X), i.e. the lower approximation of X 
is the set of positive objects with respect to X as in the classical rough sets 
theory. /x(x, R{X)) represents the credibility of the proposition “there is at least 
one y ^ U such that x is similar to y and y belongs to X” . 

Considering a fuzzy set X in U and a reflexive fuzzy binary relation R defined 
on (7, we obtain the following results (Greco, Matarazzo and Slowinski [4]). 

Theorem 1. 



y{x,R{X)) < fix{x) < y{x, R{X)),yx G U. 
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Theorem 2. If N is involutory and (5, T, A^) is a De Morgan triple (see Fodor 
and Roubens [3]), then R{X) is the eomplement of the set of negative ohjeets 
with respeet to X, i.e. 

pi{x,R{X)) =7V(Neg(x,X)),Vx G U. 



Theorem 3. If {S^T^N) is a De Morgan triple and p.u-x{x) = X{jax{x)), 
then 

pi{x,R{X)) = N{pi{x,R{U - X))),\^x e U. 

Theorems 1 to 3 can be read as the fuzzy counterparts of the following results well 
known within classical rough set approach: theorem 1 says that set X includes 
its lower approximation and is included in its upper approximation; theorem 

2 says that the upper approximation is the complement of the set of negative 
objects; theorem 3 says that lower approximation of X is the complement of the 
upper approximation of the complement of X. 

3 An illustrative example 

In order to illustrate the use of the rough approximation by means of a fuzzy sim- 
ilarity relation, we propose a simple example already considered in Slowinski [12] 
and Slowinski and Vanderpooten [16]. This example is based on a decision table 
describing 12 firms which have got an approximately equal credit in a bank. The 
firms are characterized by three condition attributes: Al: value of fixed capital, 
A2: value of sales in the year preceding the application, A3: kind of activity. 
Attributes Al and A2 are quantitative, while attribute A3 is a qualitative one 
with three possible values. Decision attribute d makes a dichotomic partition of 
the firms: d = 1 if the firm paid back its credit and d = 2 otherwise. The decision 
table is shown in Table 1. In this application we considered the following fuzzy 
logical operators: Va, b G [0, 1], 

T(a, b) = min{a^ 6), 

5(a, b) = max{a^ 6), 

N{a) = 1 — a. 

For each attribute q we consider a valued binary relation Rq^ i.e. a function 
Rq : U X U ^[0,1] where, Vx, ^ G (7, Rq{x^y) represents the credibility of 
similarity between x and y with respect to the attribute q. Practically Rq{x^y) 
is computed on the basis of the evaluations of x and y by means of attribute g, 
denoted by /(x, q) and /(^, q). More precisely, for each attribute q and Vx, y G 

Rq{x^y) = 0 means absence of similarity between x and 

Rq{x^y) = 1 means that x is absolutely similar to {Rq{x^x) = 1), 

Rq{x^ y) > Rq{w^ z) means that the similarity between x and y is at least as 
credible as the similarity between w and z. 
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Table 1. Sample of firms 



Firm 


A1 


A2 


A3 


d 


PI 


43 


78 


0 


1 


P2 


54 


75 


1 


T 


P3 


124 


50 


T" 


T 


P4 


102 


65 


T" 


T 


P5 


98 


80 


"T 


T 


P6 


88 


102 


"T 


T 


F7 


130 


57 


T 


T 


FS 


128 


92 


1 


T 


F9 


82 


59 


T" 


T 


PIO 


134 


103 


"T 


T 


Pll 


58 


55 


" 0 “ 


T 


P12 


126 


71 


1 


T 



The similarity relation with respect to A1 and A2 was defined, \/x^y G as 
( 1 \f{x,q) - f{y,q)\ <0.3f{y,q) 

Rg{x, y) = \ 0-3/0, ?) < I/O, ?) “ /O, 9)| < 0.4/ (y, q) 

I 0 if |/(a;,g) - /(y,g)| > 0.4/(y,g), 

where q = 1,2. With respect to A3 the similarity relation was defined, \/x^y G 
as 

fl if/(a;,3) = /(y,3) 

"3 \0 if/(x,3)^/(y,3). 

For attribute A3 we shall write therefore “x has the same evaluation of y with 
respect to A3” , rather then “x is similar to y with respect to A3” , Vx, y ^ U. 

To model the comprehensive similarity between x and y ^ U with respect 
to P C {A1,A2,A3}, denoted by Rp{x^y)^ we consider the credibility of the 
proposition “x is similar to y with respect to all the attributes of P ” . Thus we 
obtain 

Rp{x, y) = Tg^p{Rg{x, y)). 

We approximate the set of the firms which paid back, i.e. 

Xi = {PI, P3, P7, P9, Pll}, and the set of firms which did not pay back, 
i.e. X 2 = {P2,P4,P5,P6,P8,P10,P12}. 

Let us observe that Xi and X 2 are crisp sets. Of course, we can also in this 
case apply our approach by stating = 1 if x e X and yx{x) = 0 if x ^ X, 

X = Xi,X2. 

We calculated the lower and upper approximation of Xi and X 2 obtaining 
the results presented in Table 2. 

The knowledge contained in the considered decision table (Table 1) can be 
expressed in terms of certain or possible fuzzy decision rules, which are spe- 
cific “if. . ., then . . .” propositions with an associated credibility (see Greco, 
Matarazzo, Slowinski [4]). 
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Table 2. Lower and upper approximations with respect to {Ai, A 2 , A 3 } 



Firm 


x,R{Xi)) 


:,R{X2)) 


^ R{x,)) 


:,R{X2)) 


FI 


1 


0 


1 


0 


F2 


0 


0.41 


0.59 


1 


F3 


0 


0 


1 


1 


F4 


0 


0 


1 


1 


F5 


0 


1 


0 


1 


F6 


0 


1 


0 


1 


F7 


1 


0 


1 


0 


F8 


0 


1 


0 


1 


F9 


0 


0 


1 


1 


FIO 


0 


1 


0 


1 


Fll 


1 


0 


1 


0 


F12 


0 


1 


0 


1 



The following set of minimal certain rules was obtained from the considered 
decision table (each pair (^, c) in parenthesis shows the object y G U which 
supports the considered rule with a credibility equal to c G [0, 1]): 

1. if /(x, 1) is similar to 88 and /(x, 2) is similar to 102, then x e X 2 with a 
credibility equal to 1 ((F2,0A4), (F4,0.37), (T5,l), (T6,l)); 

2. if /(x, 1) is similar to 128 and /(x, 2) is similar to 92, then x e X 2 with 
a credibility equal to 0.59 ((T4,l), (F5,l), (F6, 0.875), (T8,l), (T10,l), 
(^12,1)); 

3 . if f{x, 1) is similar to 134 and f{x, 2) is similar to 103, then x G X 2 with 
a credibility equal to 1 ((F4,0.31), (F5,l), (F6,0.57), (F8,l), (F10,l), 
(F12,0.89); 

4. if f(x,3) = 0, then x & Xi with a credibility equal to 1 ((Fl,l), (F7, 1), 

5. if f{x,3) = 2, then x e X 2 with a credibility equal to 1 ((F5, 1), (F6, 1), 
(F10,l)); 

6. if /(x, 1) is similar to 54 and /(x, 3) = 1, then x e X 2 with a credibility 
equal to 0.41 ((T2, 1)); 

7. if /(^, 2) is similar to 92 and /(x, 3) = 1, then x e X 2 with a credibility 
equal to 0.59 ((F2, 1), (F4, 1), (F8, 1), (F12, 1)). 

Also some possible decision rules can be obtained from the considered deci- 
sion table, e.g.: 

8. if /(x, 2) is similar to 50, then x e Xi with a credibility equal to 1 ((T3, 1), 
(F7,l), (F9,l), (Fll,l)). 

9. if /(x, 2) is similar to 98, then x e X 2 with a credibility equal to 1((T2, 1), 
(F4,l), (F5,l), (F6,l), (F8,l), (F10,l), (F12,l)). 
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Let us remark that since rule 8. is a possible decision rule it must be read “if 
/(x, 2) is similar to 50, then x could belong to X\\ An analogous interpretation 
should be given to rule 9.. 

4 Conclusions 

We introduced rough approximations of fuzzy sets by means of similarity relation 
defined as a reflexive fuzzy binary relation. The proposed framework represents 
a theoretical development with respect to the extension of rough set approach 
to a fuzzy context and also with respect to the approximations by means of 
binary relations more general than classical indiscernibility. Some fuzzy decision 
rules can be extracted from the approximations obtained by the fuzzy similarity 
binary relations. As shown by a simple example, the new rough set approach to 
decision table analysis gives quite comprehensible results. 
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Abstract. In this paper we present a generalization of the approxima- 
tion space notion. We also present different notions of rough relations. 
We point out the role of searching for proper approximation space. 



1 Introduction 

Rough set theory was proposed [7] as a new approach for processing of incomplete 
data. 

Investigations on relation approximation are well motivated both from theo- 
retical and practical points of view. Let us mention two examples. The equality 
approximation is fundamental for a generalization of the rough set approach 
based on a similarity relation approximating the equality relation in the value 
sets of attributes. Rough set methods in control processes require function ap- 
proximation. 

One can distinguish several directions in research on relation approximations. 
Below we list some examples of them. In [6,15] properties of the rough relations 
are presented. The relationships of rough relations and modal logics have been 
investigated by many authors (see e.g. [20,10]). We will refer to [10], where the 
upper approximation of the input-output relation R{P) of a given program P 
with respect to indiscernibility relation IND is treated as the composition IN Do 
R (P) oIND and where a special symbol for the lower approximation of R (P) is 
introduced. Properties of relation approximations in generalized approximation 
spaces are presented in [11,18]. The relationships of rough sets with algebras of 
relations are investigated for example in [1]. 

Relationships between rough relations and a problem of objects ranking are 
presented for example in [2]. It is shown that the classical rough set approxi- 
mations based on indiscernibility relation do not take into account the ordinal 
properties of the considered criteria. This drawback is removed by considering 
rough approximations of the preference relations by graded dominance relations 
[ 2 ]. 

One of the problems we are interested in is the following: given a subset X C 
17 or a relation R C U x U ^ define X or R in terms of the available information. 
In this paper we discuss an approach based on generalized approximation spaces 
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introduced and investigated in [11,13]. There are several modifications of the 
original approximation space definition [7]. 

The first one concerns the so called uncertainty function. Information about 
an object, say x is represented for example by its attribute value vector. The set 
of all objects with similar (to attribute value vector of x) value vectors creates the 
set I {x). In [7] all objects with the same value vector create the indiscernibility 
class. The relation y e I {x) is in this case an equivalence relation. We consider 
a more general case when it can be any relation. 

The second modification of approximation space definition introduces a gen- 
eralization of a rough membership function. We assume that to answer a question 
whether an object x belongs to an object set X we have to answer a question 
whether I {x) is in some sense included in X. Hence we take as a primitive no- 
tion a rough inclusion function rather than rough membership function. Our 
approach allows to unify different cases considered in [7,22]. 

2 Approximations in Approximation Spaces 

In this section we present general definition of approximation space [11,13] which 
can be used for example for introducing the tolerance based rough set model and 
the variable precision rough set model. 



2.1 Approximation Spaces 

An approximation space is a system AS = where 

— U is n non-empty set of objects, 

— I : U — ^ P {U) is an uncertainty function {P {U) denotes the set of all 
subsets of 7/), 

— ]y : P {U) X P {U) — ^ [0, 1] is a rough inclusion function. 

An uncertainty function defines a neighborhood of every object x. 

The rough inclusion function defines the value of inclusion between two sub- 
sets of U. 

Definitions of the lower and the upper approximations can be written as 
follows: 

L{AS,X) = {xeU :u{I{x),X) = 1}, 

U{AS,X) = {xeU :u{I{x),X) > 0}. 

For examples of approximation spaces for the variable precision rough set model 
[22] and generalized approximation spaces in information retrieval problem see 

[ 17 ], 

We recall the notion of positive region in the case of generalized approxima- 
tion spaces. Let AS = (77, /, u) be an approximation space and let {Xi, . . . , Xr} 
be a classification of objects (i.e. Xi, . . . , X^ C 7/, |Ji=i — P X^ nXj = 0 
for i ^ j, where i, j = 1, . . . , r). 
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The positive region of the classification with respect to the 

approximation space AS is defined by 

r 

POS {AS, {X^,...,Xr})=[jL {AS, Xi) . 

i=l 



2.2 Properties of Approximations and Rough Definability 

In this subsection we first list properties of approximations in generalized ap- 
proximation spaces. Next we present algorithms for checking rough definability, 
internal undefinability etc.. 

Let AS = (L, i, i/) be an approximation space. We define for two sets X, T C 
U the equality with respect to the rough inclusion i/ in the following way: X = 1 , Y 
if and only if i/ (X, T) = 1 = i/ (T, X) 

Assume that the following conditions are satisfied: 



— X ^ I {x) , for every x G 

— i/ is a standard rough inclusion i.e. u (X, Y) = 
X,Y CU. 



card(XnY) 

card{X) 

1 



ifX 7^ 0 
ifX = 0 



for any 



Then one can show the following properties of approximations: 



1. v{L{AS,X),X) = 1 &ndv{X,U {AS,X)) = 1. 

2 . L {AS, 0) U {AS, 0) 0, L {AS, U) U {AS, U) U. 

3. U {AS, X U T) [/ X)UU {AS, Y) . 

4. L {AS, X n T) T X) n L {AS, Y) . 

5. 1/ {X, Y) = 1 implies i/ {L {AS, X) , L {AS, T )) = p {U {AS,X) , U {AS, T )) = 

1 . 

6 . V{L{AS,X)\JL{AS,Y),L{AS,X\JY)) = 1. 

7. 1/ {U XC\Y),U {AS,X) n U {AS, T )) = 1. 

8 . L{AS,U-X) =^U -U{AS,X). 

9. U{AS,U -X)=^U -L {AS,X). 

10. 1/ {L {AS, L X)) , L X)) =p{L {AS,X) , U {AS, L (^SW))) = 1- 

11. i/ {L {AS, U (^SW)) , U {AS,X)) = v{U {AS,X) , U {AS, U {AS, X))) = 1. 



By analogy with standard rough set theory we define the following four types 
of sets: 



1. X is roughly AS-definahle iff L {AS,X) 7^^ 0 and U {AS,X) 7^^ U. 

2. X is internally AS-undefinaMe iff L {AS,X) = 1 , 0 and U X) U. 

3. X is externally AS-undefinable iff L {AS,X) 0 and U X) =j^ U. 

4. X is totally AS-undefinaMe iff L (^S',X) 0 and U {AS,X) U. 
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The algorithms for checking corresponding properties of sets have O (n^) 
time complexity, where n = card (U). For example we sketch algorithm for rough 
AS- definability . 

function rough_definability(/, i/, X): boolean; 
var tempi, temp 2 : boolean; 

begin 

templ:=FALSE; temp2:=FALSE; 

for X do 
begin 

if V (/ {x) ,X) = 1 then templ:=TRUE; 
if n (/ {x) , X) = 0 then temp2:=TRUE; 
if tempi AND temp2 then return(TRUE) 
end; 

return(FALSE) 

end; 

Let us also observe that using properties of approximations: 

L{AS,U -X) U-U{AS,X) and U {AS,U - X) U-L{AS,X) 
one can obtain that X is internally AS -unde finable if and only if U — X is 
externally AS-undefinable, Thus we can use the same algorithm in both cases. 



2.3 Approximation Spaces and Rough Relations 

In this subsection we discuss approximations of relations with respect to different 
rough inclusions. 

Let AS = be an approximation space, where U = Ui x U 2 and I 

defines a partition of U. The rough inclusion function is defined for any relations 

{ card(SnR) -f o / O 
card(S) ^ ^ ^ . 

1 if 5 = 0 

For any relation R C x t /25 we define two relations L (A5, R) and U (A5, R) 
called the lower and the upper approximation of i?, respectively, and defined as 
follows: 

L R) = {{xuX2) ((xi,X2)) , i?) = 1} , 

U(AS,R) = {(xi,X2) €U :i^ {I i{xi,X2)),R) > 0}. 

By TTj (R) we denote the projection of the relation R onto the i — th axis i.e. 
for example for i = 1 



7 Ti (R) = {xi e Ui : 3*2€!72 {xi,X 2 ) e R}. 

The definition of the rough inclusion function for relations can be based on 
the cardinality of the projections. Let us assume 



(S,R) 



card{7Ti{SnR)) ' r q 
card(7Ti(S)) 

1 ^5* = 



where i = 1,2. 
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The standard lower approximation of a relation R C U 1 XU 2 has the following 
property: any objects x\ ^ Ui^X2 ^ U2 are connected by the lower approximation 
of R if and only if any objects (^1,^2) from I {{xi^X 2 )) are in the relation R. 

One can propose some less restrictive definitions of the lower approximation 
using the rough inclusions and Let AS^^. = {Ui x and 

L{AS^,,R) = {{xi,X2) ^UixU2'. J^TTi {I {{xi,X2)) ,R) = 1} 

for i = 1 , 2 . Assuming i = 1 we have that the pair (xi,X2) is in the lower 
approximation L {ASt^. , R) if and only if for every yi there is y 2 such that the 
pair (^1,^2) is from I {{xi^X 2 )) and in the relation R. Similar interpretation we 
obtain for i = 2. 

One can also propose some more restrictive definition of the lower approx- 
imation using the approximation space ASres = {U\X U 2 ,I,i^res) with rough 
inclusion i/res defined by: 

Vres <yS, R) = ly (S, R)*v (tti (i?i) , 7Ti (S')) * V (tV 2 (R 2 ) , 7T2 (S)) 

where Ri = RCi {U\ x 7T2 (S)) and R 2 = RC\ (tti (S) x U 2 ) ■ 

Proposition 1. For the lower and the upper approximations the following eon- 
ditions are satisfied: 

L L {ASres. R) CL{AS,R) C R. 

2. L {AS, R)QL {AS ^^ , R) and L {AS, R) C L {AS^^ , R) . 

3. U {AS, R) = U {AS^^ .R) = U {AS^^ .R) = U {ASres. R) • 

Proposition 2. The eomputational eomplexity of algorithms for eomputing ap- 
proximations of relations is equal to O (^{card {U))‘^^ . 

3 Searching for Approximation Spaces 

In this section we consider problem of searching for adequate uncertainty func- 
tion in approximation space. The searching for proper uncertainty function is 
the crucial and most difficult task related to decision algorithm synthesis based 
on uncertainty functions. 

One approach to searching for an uncertainty function is based on the as- 
sumption that there are given some metrics (distances) on attribute values. Dis- 
tance and similarity are closely related. Relations obtained on attribute values 
by using metrics are reflexive and symmetrical i.e. are tolerance relations. For 
review of different metrics defined on attribute values see [ 21 ]. Here we only 
present two examples of such metrics. 

The Value Difference Metric (VDM) provides an appropriate distance func- 
tion for nominal attributes. A simplified version of VDM (without the weighting 
schemes) defines the distance between two values v and v' of an attribute a as: 

r(d) 

vdma {v,v') = ^{v (V,Xj) - v{X^,,Xi)f , 

i=l 
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where r{d) is a number of decision classes, i/ is the standard rough inclusion, 
Xi = {x ^ U : d{x) = i} and Xy = {x ^ U : a {x) = v} . 

Using the distance measure VDM, two values are considered to be closer 
if they have more similar classifications. For example, if an attribute color has 
three values red, green and blue, and the application is to identify whether or 
not an object is an apple, then red and green would be considered closer than 
red and blue because the former two both have correlations with decision apple. 

If this distance function is used directly on continuous attributes, all values 
can potentially be unique. Some approaches to the problem of using VDM on 
continuous attributes are presented in [21]. 

One can also use some other distance function for continuous attributes, for 
example 



diffa {v, v') 



|v — v'\ 

maX(j — miiia ’ 



where max^ and min^ are the maximum and minimum values, respectively, for 
attribute a e A. 

Let Sa :Va X Va — ^ [0, oo) be a given distance function on attribute values, 
where Va is the set of all values of attribute a e A. 

One can define the following uncertainty function 



y e J„“ (a;) if and only if 5a {a (x) ,a(y)) < Sa, 



where > 0 is a given real number. 

Some further examples of uncertainty functions we can also derive from the 
literature. In [14] strict and weak indiscernibility relations were considered which 
can define some kind of uncertainty functions. In some cases it is natural to con- 
sider relations defined by e-indiscernibility [3] . For more details on corresponding 
uncertainty functions see [17]. 

Different methods of searching for parameters of proper uncertainty functions 
are discussed for example in [3,8,13,16,5]. In [16,4] genetic algorithm was applied 
for searching for adequate uncertainty functions of the type . 

Now we discuss the following problem: 

How large is a set of possible uncertainty functions for a given decision table? 
In other words how large is a search space for searching for optimal uncertainty 
function? 

Let DT = (U, A U {d}) be a decision table, where U is a non-empty set such 
that card (U) = n and A = {ai, . . . , a^n} is a set of condition attributes, where 
n, m > 0 are given natural numbers. 

In the further analysis we assume that for every attribute a G A there is 
some metric Sa on the set of values and we consider uncertainty functions la : 
U P iU) — { 0 } such that the following conditions are satisfied: 

- X ela{x), 

— for every x^y^z eU A Sa {a {x) , a {y)) < Sa {a {x ) , a {z)) and z e la {x) , then 

y ^ la {x) (i.e. if a distance between x and y is not greater than a distance 

between x and z, and z is related to x, then y is related to x). 
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For example the above conditions are satisfied by uncertainty function 
The global uncertainty function is defined as the intersection i.e. I a (x) = 
HaGA (x) . 

We introduce some notions which are used in the next theorem. For every 
object X ^ U we define an equivalence relation E(6a^x) as follows: 

{y, z) e E {Sa, x) if and only if Sa (a {x ) , a {y)) = Sa (a {x ) , a {z)) . 

We number equivalence classes of the relation E{Sa^x) starting from the 
closest to the object x. Let P{6a,x) • U {1^ . . . ^card {U / E {5a^x))} be a 
numbering such that for every y^z e Up(^Sa,x) iv) ^ P{6a,x) (^) if only if 
Sa (a {x) , a {y)) < Sa (a {x) , a {z)) . 

Let G U/E (Sa^x) be a set of objects y G U with different decision 

than X and as close as possible to x i.e. y G y{6a,x) if only if d{x) ^ d{y) 
and {{Sa {a {x) , a {z)) < Sa {a {x) , a {y))) ^ d{x) = d {z)) . 

Theorem 1. 1. For every attribute a ^ A the number of possible different 

uneertainty funetions is equal to Yixeu {U/E {Sa^x)) . 

2. Let y be a given object. The number of the funetions la^ where a ^ A such 
that y ^ Ia (a;) is equal to * CEaeA (P(^a,x) iv) “ l)) • 

3. The number of the funetions la^ where a e A such that for every y G y{6a,x) 
y ^ ClaeA^ci {x) is not greater than n^. 

Conclusions. We have presented different notions of rough relations. We 
hope that they can be considered as good starting point for investigations of im- 
portant problems related to relation approximation for example the optimization 
problem for controllers design. We point out the role of searching for proper un- 
certainty functions. Adequate approximation spaces are important for extracting 
laws from decision tables. 
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[ Abstract.] This paper reviews and discusses generalizations 
of Pawlak rough set approximation operators in mathematical 
systems, such as topological spaces, closure systems, lattices, 
and posets. The structures of generalized approximation spaces 
and the properties of approximation operators are analyzed. 



1 Introduction 

In the development of the theory of rough sets, approximation operators are 
typically defined by using equivalence relations [10,11]. Researchers have pro- 
posed many generalized notions of approximation operators [2,12,15,17,19,20]. 
Based on the results of these studies, we review and discuss generalizations of 
Pawlak rough set approximation operators, and show their connections with 
other mathematical systems. 

We interpret the rough set theory as an extension of set theory with two 
additional unary set-theoretic operators referred to as approximation opera- 
tors [16,18]. Such an interpretation is consistent with interpreting modal logic 
as an extension of classical two-valued logic with two added unary operators [5]. 
With respect to an equivalence relation on a finite and nonempty universe, one 
can construct a subsystem of the power set of the universe, namely, the a- algebra 
or the topology generated by the equivalence classes. Every subset of the universe 
is approximated by two sets of the subsystem. By generalizing this formulation, 
one may obtain generalized approximation operators. The resulting systems are 
related to topological spaces and closure systems. These systems are well-known 
in logic and algebraic literature [3,7,9,13,14]. We will first review and apply the 
relevant results for the present study of approximation operators. Further gen- 
eralizations of approximation operators are studied using posets based on the 
recent work of Cattaneo [2]. 

2 Set-theoretic Approximation Operators 

In this section, we apply results from the algebraic study of logic [3,7,13,9,14] to 
Pawlak type approximation operators. A common formulation is adopted from 
a recent paper by Cattaneo [2]. Two subsystems of the power set of a universe 
are considered. They are the family of inner definable subsets of the universe 
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and the family of outer definable subsets. An arbitrary subset of the universe is 
approximated by an inner definable subset and an outer definable subset. 

Let E C U X U he oil equivalence relation on a finite and nonempty uni- 
verse U . That is, the relation E is reflexive, symmetric, and transitive. The pair 
apr = {U^E) is called a Pawlak approximation space. The equivalence relation 
E partitions the universe U into disjoint subsets called equivalence classes. Ele- 
ments in the same equivalence class are said to be indistinguishable. Equivalence 
classes of E are called elementary sets. A union of elementary sets is called a 
definable (composed) set [10,11]. The empty set is considered to be a definable 
set [16]. The family of all definable sets is denoted by Def(G). It is an a-algebras 
of subsets of G. A Pawlak approximation space defines uniquely a topological 
space (G, Def(G)), in which Def(G) C 2^ is the family of all open and closed 
sets [10]. 

Eor a subset ACL, one can approximate A by a pair of subsets of U . 
The lower approximation i(A) is the greatest definable set contained in A, and 
the upper approximation c(A) is the least definable set containing A. They 
correspond to the interior and closure of A in the topological space (G, Def(G)). 
Thus we have the definition: 

(P) i(A') = U{P I P 6 Def(t/),y C A'}, 

.(A') = ri{i' |y GDef(G),Acy}. 

One may interpret i,c : 2^ — ^ 2^ as unary set-theoretic operators [8,15]. They 
are dual operators in the sense that i(A) = -ic(-iA) and c(A) = -ii(-iA). The 
system (2^, i, c. Pi, U) is called a Pawlak rough set algebra. It is an extension 
of the classical set algebra (2^, Pi, U). 

Pawlak approximation operators have the following properties: 

(11) i(A P T) = i(A) Pi(T), 

(12) i(A)CA, 

(13) i{i{X))=i{X), 

(14) i{U) = U, 

(15) c{X) = i{c{X)), 

and 

(cl) c{XuY) =c{X)Uc{Y), 

(c2) XCc{X), 

(c3) c{c{X)) = c{X), 

(c4) c(0) = 0, 

(c5) i{X)=c{i{X)). 

The above sets of properties are not independent. In fact, (i2) and (i5) imply (i3), 
and (c2) and (c5) imply (c3). The first four properties are the Kuratowski axioms 
for topological interior and closure operators. Axioms (i5) and (c5) show that an 
inner definable subset is also outer definable, and vice versa. Conversely, given a 
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pair of approximation operators i,c : 2^ — ^ 2^ satisfying axioms (il)-(i5) and 
axioms (cl)-(c5), respectively, their fixed elements: 

Def(t/) = {X I i{X) =X} = {X\ c{X) = X}, (1) 

are the open, and the closed, sets of an 0-dimensional topological space. 

In the formulation of Pawlak approximation operators, a special type of topo- 
logical space is used, in which the set of inner definable subsets (open sets) is 
the same as the set of outer definable subsets (closed sets). For an arbitrary 
topological space, the family of open sets is different from the family of closed 
sets. This immediately leads to a generalization of Pawlak approximation op- 
erators. Let (L, 0(t/)) be a topological space, where 0{U) C 2^ is a family 
of subsets of U called open sets. The family of open sets contains 0 and L, 
and is closed under union and finite intersection. The family of all closed sets 
C[U) = {->X I X G 0[U)} contains 0 and t/, and is closed under intersection 
and finite union. Following Cattaneo [2], a pair of approximation operators is 
defined by: 



(T) i(X) = U(y \Y e 0{U),Y CX}, 
c{X) = f]{Y \ Y eC{lJ),XCY}. 

They satisfy axioms (il)-(i4), and axioms (cl)-(c4), respectively. Conversely, 
given approximation operators, i, c : 2^ — ^ 2^, satisfying axioms (il)-(i4) and 
axioms (cl)-(c4), the sets of their fixed points: 

0{U) = {X\i{X)=X}, 

C{U) = {X\c{X)=X}, (2) 

are families of open and, respectively closed, sets of a topological space. 

The notion of closed sets in a topological space may be further generalized. 
A family C{U) of subsets of U is called a closure system if it contains U and is 
closed under intersection [3]. By collecting the complements of members of C(C), 
we obtain another system 0[U) = {->X \ X G C{U)}. According to properties 
of C(C), the system 0[U) contains the empty set 0 and is closed under union. 
In this case, we define two approximation operators in a closure system: 

(C) i{X) = |J{r I Y e 0 {U),Y C X}, 
c(j^) = f|{r I Y eC{u),x c r}. 

They satisfy axioms (i2) and (i3), and axioms (c2) and (c3), as well as the 
following weaker version of (il) and (cl): 

(iO) If A C Y, then i{X) C i(Y), 

(cO) If A C Y, then c(A) C c(Y). 

Conversely, for a closure operator c : 2^ — ^ 2^ satisfying axioms (cO), (c2), and 
(c3), the set of its fixed points: 

C(C) = {A |c(A)=A}, 



( 3 ) 
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is a closure system. Similar results can be stated between the system 0{U): 

0{U) = {X\i{X)=X}, (4) 

and the dual operator i{X) = -i(c(-i(X)). 

In defining set-theoretic approximation operators, three definitions (P), (T), 
and (C), in the order of generality, are used. For this formulation, inner definable 
sets must be closed under union and outer definable sets must be closed under 
intersection. A closure system is therefore the most generalized structure, and 
one cannot generalize set-theoretic approximation operators further under the 
same formulation. 

3 Approximation Operators in Lattices 

The power set of the universe is a special lattice. The results of the last section 
can be generalized as follows. 

Suppose (i3, -I, A, V, 0, 1) is a finite Boolean algebra and -i. A, V, 0, 1) is a 
sub-Boolean algebra. By using (P), one may approximate an element of B using 
elements of Bq: 



(LP) i{x) = \J{y \y e Bo,y < x}, 

= /\{y \y Bo,x < y}. 

Any finite Boolean algebra is a complete Boolean algebra, and hence the above 
definition is well defined. Operators i and c satisfy the axioms: 

(11) i{x Ay) = i{x) Ai{y), 

(12) i{x) < X, 

(13) i{i{x)) = i{x) ^ 

(14) *(1) = 1, 

(15) c(x) = i(c(x)), 

and 

(cl) c[x V y) = c[x) V c(y), 

(c2) X < c(x), 

(c3) c[c{x)) = c(x), 

(c4) c(0) = 0, 

(c5) i[x) = c(i(x)). 

Conversely, one may define a pair of approximation operators directly, and use 
their fixed points as inner and outer definable elements. Gehrke and Walker [4] 
considered a more generalized definition in which the Boolean algebra B is re- 
placed by a completely distributive lattice. Like the Pawlak rough set model, 
one subsystem is used. 
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Consider a subsystem 0(i3) of B satisfying the following axioms: 

(01) 0eO{B),le 0{B); 

(02) For any subsystem V C 0[B)/if there exists a least upper bound 
LUBiV) =\J then it belongs to 0(i3); 

(03) O(^) is closed under finite meet, i.e., 

for any X, y G 0(i3), we have x A y G 0(i3). 

For a finite Boolean algebra, axiom (02) in fact states that the system 0[B) 
is closed under join. Elements of 0(i3) are referred to as inner definable ele- 
ments. The complement of an inner definable element is called an outer defin- 
able element. The set of outer definable elements C(i3) = {->x \ x G 0(i3)} is 
characterized by the axioms: 

(Cl) Og C(i3),l G C{B); 

(C2) For any subsystem V C C(i3),if there exists a greatest lower bound 
GLB[V) = then it belongs to C(i3); 

(C3) C(i3) is closed under finite join, i.e., 

for any x, y G C(i3), we have x V y G C(i3). 

/,From the sets of inner and outer definable elements, we define the following 
approximation operators: 

(LT) i{x)=\/{y\yeO{B),y<x}, 

= f\{y I y e C{B),x < y}. 

They satisfy axioms (il)-(i4), and axioms (cl)-(c4), respectively, and are the 
topological interior and closure operators. The sets of fixed points of i and c are 
inner and outer definable elements, respectively. The system (i3, -i,i,c,A,V,0,l) 
is a topological Boolean algebra [14], which is an extension of Boolean algebra 
with added operators. 

Let (L,<,0, 1) be a bounded lattice. Suppose 0(L) is a subset of L such 
that it contains 0 and is closed under join, and C(L) a subset of L such that it 
contains 1 and is closed under meet. They are complete lattices, although the 
meet of 0(L) and the join of C(L) may be different from that of L. Based on 
these two systems, we can define two approximation operators as follows: 

(LC) i{x) = \!{y\y e 0{L),y <x}, 

= /\{y I y e g{L),x < y}. 

The approximation operators satisfy axioms (i2) and (i3), axioms (c2) and (c3), 
as well as the following weaker version of (il) and (cl), respectively: 

(iO) If X < y, then i(x) < i(y), 

(cO) If X < y, then c(x) < c(y). 
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The sets 0(L) and C(L) are the fixed points of i and c, respectively. The oper- 
ator c is a closure operator [1]. C[L) corresponds to the closure system in the 
set-theoretic framework. But since a lattice may not be complemented, we must 
explicitly consider both 0(L) and C(L). That is, the system (L, 0(L), C(L)), or 
equivalently the system (L, i, c), is used for the generalization of Pawlak approx- 
imation operators. 

4 Approximation Operators in Posets 

Instead of using a Boolean algebra or a lattice, one may use a poset. Such 
generalizations of Pawlak rough set model were considered by Iwinski [6], and 
were systematically studied by Cattaneo recently [2]. Let (A, <, 0, 1) be a poset 
with respect to a partial order relation < bounded by the least element 0 and the 
greatest element 1. If the sets of inner, and respectively outer, definable elements 
of S are chosen to be complete lattices, one may immediately use (LC) to define 
approximation operators. A similar idea is in fact used by Iwinski [6], although 
the definition is different from (LC). The formulation suggested by Cattaneo [2] 
is consistent with (LC). 

For an arbitrary subset A C A, a least upper bound or a greatest lower 
bound of X may not exist. If a least upper bound of X exists, it is unique and is 
denoted by LUB[X). Similarly, if a greatest lower bound of X exists, it is unique 
and is denoted by GLB[X). Any subset Xq of a poset X is itself a poset under 
the same order relation, and is called a subposet. In the subposet Ab, if the least 
upper bound of a subset X C Ab exists, we denote it by LUB 2 Jq{X) or simply 
LUB[X) when Ab is clear from the context. If the greatest lower bound of X 
exists, we denote it by GLBx;q{X) or simply GLB[X). Both Boolean algebras 
and lattice can be understood as posets with additional properties. For the three 
definitions in the last section, approximation operators are defined through the 
LUB of a subsystem of inner definable elements, and the GLB of a subsystem of 
outer definable elements. Following the same argument, we will use two subposets 
O(A') and C(A) to represent inner and outer definable elements. But we must 
require that in some sense the subposet 0(A) be closed with respect to LUB^ 
and C(A) be closed with respect to GLB, 

Given x E X and Y C A, the order ideal relative to Y generated by x is 
defined by: 



i x\Y = {y eY \ y <x}, (5) 

Dually, the order filter relative to F generated by x is defined by: 

I x\Y — {y eY \ X < y}, (6) 

With these notions, we can construct two families of subsets of 0(A) and C(A): 

SO(A) = {|x|0(A) \ xeX}, 

SC(A) = {T^|C(A) \ xeX}, (7) 

Each element of SO(A') is a subsystem of 0(A), and each element of SC(A') is 
a subsystem of C(A). 
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For the definition of abstract approximation operators in posets, we adopt 
and generalize the proposal of Cattaneo [2]. However, the formulation is slightly 
different. In particular, we explicitly specify the structure of the set of inner 
definable and the structure of the set of outer definable elements. Consider a 
triple (Y, 0(27), C(Y)) called an abstract approximation space, where Y is a 
bounded poset, 0(Y) is assumed to be the set of inner definable elements, and 
C(Y') is assumed to be the set of outer definable elements. The set of inner 
definable elements is characterized by the axioms: 

(01) 0gO(Y),1gO(Y); 

(02*) With respect to 0(Y), the least upper bound of | x|0(Y) 
exists and satisfies the condition : 

Lt/Yo(i:)(i x|0(Y)) < X, for every x G Y. 

Axiom (02*) suggests that Lt®o(i;)(i inner definable ele- 

ment. That is, O(Y') is closed under LUB at least for any subsystem of S0(Y). 
For an arbitrary subset of O(Y'), the least upper bound may not exist. Hence 
the system O(Y') may not be a lattice. The set of outer definable elements is 
defined by the axioms: 

(Cl) Og C(Y),1 G C(Y); 

(C2*) With respect to C(Y'), the greatest lower bound of | x|C(Y') 
exists and satisfies the condition : 

X < GLBc(e){^ ^|C(Y)), for every x G Y. 

The subposet C(Y) is closed under GLB for at least any subsystem of SC(Y). 
It may not be a lattice. The two systems O(Y') and C(Y) are usually different 
subsets of Y. The set 0(Y) fl C(Y) consists of those elements which are both 
inner and outer definable. 

An element of B may be approximated by a pair of inner and outer definable 
elements from O(Y') and C(Y). We define a pair of inner and outer approxima- 
tion operators, i : B — ^ C)(Y) and c : Y — ^ C(Y), as follows: 

(PC) i{x) = LUB{i x\0{B)) 

= LUB{{yeO{B)\y<x}), 
c{x) = GLB{]' x\C{B)) 

= GLB{{yeC{S)\x<y}), 

where LUB and GLB are defined with respect to 0(Y) and C(Y'). By definition, 
i[x) is the best approximation of x from below using the inner definable elements 
O(Y'), while c(x) is the best approximation of x from above using outer definable 
elements C(Y). More specifically, i satisfies axioms (iO) and (i2)-(i4), and c 
satisfies axioms (cO) and (c2)-(c4). Assume x < y. We have [ x|0(Y) y|0(Y). 

Hence, i(x) = LUB{1 x|0(Y)) < LUB{1 y\0{B)) = i(y), namely, (iO) holds. By 
(02*), i satisfies (i2). For any y G 0(Y), we have y E i x|0(Y). This implies 
y B ^(z/)- By combining it with (i2), we have i[y) = y for any y G O(Y'). For any 
X G Y, i(x) G 0(Y). Thus, i[i{x)) = i(x), namely (i3) holds. By the assumption 
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1 G 0(A) and the definition of i, it follows that i satisfies (i4). Similarly, we can 
show that c obeys (cO) and (c2)-(c4). 

Alternatively, we may define an abstract approximation space by a triple 
(A,i,c), where i and c are mappings characterized by axioms (iO) and (i2)- 
(i4), and axioms (cO) and (c2)-(c4), respectively. The sets of Afixed and c-fixed 
elements: 



0(A) = {x G A' I i{x) = x}, 

C(A') = {x G A' I c[x) = x}, (8) 

are the sets of inner and outer definable elements, respectively. By axioms (i2) 
and (i4), it can be easily verified that 0, 1 G 0(A). Consider an element x G A. 
Suppose y G i x|0(A). We have y < x and y — i(y). By axiom (iO), it follows 
y = '^{y) A i{x). Thus, i[x) is an upper bound of I x|0(A). Now, we want to 
show that it is in fact the least upper bound. Suppose z is an upper bound of 
I x|0(A'). We have y < z for all y G | x|0(A). By axiom (i2), i[x) < x. By 
axiom (i3), i{x) G 0(A). Therefore, i(x) G | x|0(A). It immediately follows that 
i{x) < z. This implies that (02*) holds. Similarly, one can show that C is the 
family of outer definable elements satisfying axioms (Cl) and (C2*). 

In the previous formulation of approximation operators, two methods have 
been used. One starts from the system (A, 0(A), C( A)) where the two subposets 
O(A') and C(A)) are given specific structures. From this system one can define 
two approximation operators, i and c, enjoying certain properties. Dually, the 
second method starts from a pair of approximation operators, namely, the sys- 
tem (A, i, c) satisfying some axioms, and a system (A, 0(A),C(A)) is recovered 
by the sets of fixed points of i and c. In the formulation of Cattaneo [2], the 
structures of two subposets are stated implicitly by using the approximation 
operators. Our formulation avoided such a problem. This makes our discussion 
of approximation operators to be conform to the commonly used approaches for 
the study of approximation operators in systems such as topological spaces and 
closure systems. 

Approximation operators in posets can be further generalized. Suppose that 
the set of inner definable elements 0(A) satisfies the axiom: 

(or) og 0(A), 

and axiom (02*). The approximation operator i defined by (PC) only satisfies 
axioms (iO), (i2), and (i3). Similarly, if the set of outer definable elements C(A) 
satisfies the axiom: 



(Cr) 1 e C(X), 

and axiom (C2*), the approximation operator c defined by (PC) only satisfies 
axioms (cO), (c2), and (c3). They correspond to approximation operators in 
closure systems. A subposet C(A) is said to be a closure system if it satisfies 
axioms (Cl*) and (C2*). In the set-theoretic setting, a closure system must be 
closed under intersection for any of its subsystem. A closure system on a poset 
is closed under GLB for only certain subsystems. 
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Suppose the set of inner definable elements 0(Y) satisfies the axioms (01*) 
and (02**): for x G 

(02**) With respect to 0(Y), the least upper bound of | x|0(Y') exists, 

and the set of outer definable elements C(Y) satisfies the axioms (Cl*) and 
(C2**): for x G Y, 

(C2**) With respect to C(Y), the greatest lower bound of | x|C(Y) exists. 

In this case, approximation operator i defined by (PC) satisfies axioms (iO), (i3), 
and i(0) = 0. Approximation operator c satisfies axioms (cO), (c3), and c(l) = 1. 

A more detailed and systematic study of approximation operators in special 
types of lattice and posets, as well as examples, can be found in a recent paper 
by Cattaneo [2]. A different formulation of approximation operators in poset can 
be found in an earlier paper by Iwinski [6] . 

5 Conclusion 

In generalizing Pawlak approximation operators, we have considered four sys- 
tems. From more particular instantiations to more general cases, they are Pawlak 
approximation spaces (0-dimensional topological spaces), topological Boolean al- 
gebras (topological spaces), closure systems, and abstract approximation spaces. 
For the definition of approximation operators, two subsystems, corresponding to 
the set of inner definable elements and the set of outer definable elements, are 
used. An arbitrary element is approximated from blow by using inner definable 
elements through LUB^ and from the above by using outer definable elements 
through GLB. In Pawlak approximation spaces, the set of inner definable ele- 
ments is the same as the set of outer definable elements. It contains 0 and 1, and 
is closed under both LUB and GLB. For topological Boolean algebras, the set 
of inner definable elements is closed under LUB and finite GLB^ while the set of 
outer definable elements is closed under GLB and finite LUB. For closure sys- 
tem, the set of inner definable elements is only closed under LUB^ and the set of 
outer definable elements is only closed under GLB. For abstract approximation 
spaces, the set of inner definable elements is partially closed under LUB (i.e., 
for only some subsystems), while the set of outer definable elements is partially 
closed under GLB. The structure of the family of inner definable elements deter- 
mines the properties of inner approximation operator. Likewise, the structure of 
the family of outer definable elements determines the properties of outer approx- 
imation operator. Dually, one may start from a pair of approximation operators. 
The set of inner definable elements can be obtained by the fixed points of the 
inner approximation operator, while the set of outer definable elements can be 
obtained by the fixed points of the outer approximation operator. 
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Abstract. An Intelligent Inspection Engine (HE) for classification of 
non-regular shaped objects from images is described and evaluated using 
real-world data from a waste package sorting application. The entire 
system is self- organizing. Principal component analysis and additional a 
priori knowledge on color properties are used for feature extraction. As 
classifiers growing neural networks provide robustness and minimize the 
number of runs for parameter tuning. We propose a method to encompass 
feature extraction and classification within a bootstrap procedure. These 
method reduces the immense memory requirement for the computation 
of principal components if number and size of training images are huge 
without to much loss of recognition quality. 



1 Introduction 

Visual classifier systems have been intensely studied in the past three decades. A 
lot of algorithms, theoretical results and systems have resulted from this research. 
Especially systems for object recognition have been proposed, e.g. [10]. Most of 
these systems are designed to recognize a large number of unique objects from 
different appearances v^hile the objects itself do not vary much. That means 
the appearance of an object changes mainly because of changing illumination, 
angle of view and background but less because of changes of the object itself. 
Application areas are face recognition, scene recognition and search in image 
databases. 

In this paper we face a more unstructured recognition task. We present a vi- 
sual classiher designed to recognize the membership of a large number of different 
objects to classes. Visual object properties like shape and color within each class 
vary in a wide range. There are no objects with unique appearance in a single 
class. In terms of statistics we can say the density distribution of the classes have 
much more modes in comparison to the tasks explained above. Applications of 
this type of visual classifiers are e.g. the sorting of non-regular shaped objects like 
agricultural products, waste packages, food or also the search in image databases 
using concept queries. For real-world applications as industrial automation these 
systems have to be robust, fast and flexible to use. Especially flexibility is often 
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very important. The user must be able to use the same system for different vi- 
sual classihcation tasks without much cost-intensive modihcations. The system 
must provide full self-organization including the preprocessing and should offer 
multi-sensor fusion ability. Changes of the classification task must be manage- 
able by non-pattern-recognition-experts. The task described is not solvable by 
pattern matching or by classihcation from descriptive shape representations. We 
will show in this paper how different classihers compare for the real-world visual 
classihcation task of waste package sorting. The system has been implemented on 
a Parsytec MIMD parallel computer and recognizes (depending on the number 
of processors) for this special application up to 10 objects per second. 

2 System architecture and algorithms 

2,1 Image capture and preprocessing 

The goal of preprocessing is to reduce the noise and disturbations. That is to 
reduce the variance of images of same classes. The images are taken from objects 
at a black conveyor belt illuminated by a hash light because of the speed of belt. 
The objects orientation at the belt is not dehned. To reduce the resulting high 
inner class variance a translation and rotation invariant transformation is done: 
First the background is removed by thresholding. Than the object is centered 
by shifting the mass point (center of gravity) of the relating grey image to the 
geometric center point. Thereafter the image is rotated with respect to the hrst 
and second inertial moment of the grey image. It is clear that this preprocessing 
method reduces the inner class variance because objects of same shape which 
might have different position and orientation at the conveyor belt get a more 
similar representation for preprocessed images. 



2.2 Feature Extraction 

The system should provide full self-organization. Therefore a feature extrac- 
tion method is desired which can be parameterized by Learning from Examples. 
The discrete Karhunen-Loeve-Transformation (KLT) is such a method under 
assumption of a normal distribution. It became popular for image retrieval and 
object recognition within the last few years because of the availability of high- 
performance computers. The KLT transforms an image linearly to the orthogo- 
nal space spanned by the eigenvectors of the estimated covariance matrix of all 
sample images. Because of high correlation of data in image regions it is only 
necessary to use a space spanned by eigenvectors of the n largest eigenvalues. 

The problem with the KLT is the computation of the eigenvectors of the 
very large estimated covariance matrix. The matrix contains about 3.5 • 10^^ real 
elements for example if the training set contains RGB images of size 256 x 256. 
That is 500 G-Bytes if we consider the symmetry of the matrix and assume 4 
Bytes used to code an element of the matrix! To overcome that problem a method 
introduced in [11] has been used: If the number N of sample images is smaller 
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than the image size it is only necessary to compnte the implicit eigenvectors of 
the implicit covariance matrix which has the size N x TV. Thereafter the TV — 1 
eigenvectors are compnted as linear combination of the sample images with the 
implicit eigenvectors as coefficients and normalized. The idea is that there exist 
only TV — 1 eigenvectors with non-zero eigenvalnes. The implicit eigenvectors are 
compnted nsing a conjngated gradient procednre. 

Still one limitation remains: To compnte the implicit covariance matrix in 
an acceptable time all the images mnst be in the compnter main memory. To 
overcome this a bootstrap method explained later is nsed. 



2.3 The TACOMA classifier 

To rednce the engineering efforts for the design of nenral network architectnres a 
data driven algorithm is desirable which constrncts a network dnring the train- 
ing process antomatically. Especially if the nenral network is part of a system 
snbject to be managed by a non-nenral-network-expert a very robust algorithm 
is necessary. Within the HE the TACOMA growing neural network architecture 
is used [6]. 

The idea TACOMA is based on is to build a feed-forward neural network 
bottom-up by cyclically inserting cascaded hidden layers. The activation function 
of a hidden layer unit combines the local characteristics of radial basis function 
units (or other window functions) with sigmoid units. With each growth step 
a hidden layer consisting of such local attention neurons will be inserted. The 
number of units to be inserted and their attention regions, means and ratios are 
given by an approximation of the mapping of the residual error from output to 
input space. The attention region of a unit becomes restricted to a region in the 
input space where the residual error is still high. The attention regions of the 
hidden units of the same hidden layer do not overlap (no lateral influence) and 
can be considered as units of different subnetworks. Contrary to the Cascade- 
Correlation [3] Learning Architecture different correlation measures are used to 
train the units. The TACOMA algorithm provides good generalization properties 
as shown with different experiments [6], [7], [8], [9] . 

3 Experiments 

3.1 Data and error measures 

The prototype of the Intelligent Inspection Engine has been trained and evalu- 
ated with 3402 24 bit RGB images from real-world waste packages. The image 
size is 376 x 280. A training and test set has been generated each containing 
1701 images, both reflecting the number of representation in each class. Eor the 
bootstrap method described later four training sets each with 423 images have 
been used. Table 1 shows the descriptions of the seven classes. The classes are 
very different represented in the set which makes the classification even harder. 
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Table 1. Description of object classes and frequencies 



label 


description 


freq. 


% 


Cl 


plastic bottles white 


72 


2.12 


C2 


plastic bottles transparent 


158 


4.64 


C3 


plastic bottles colored w. stickers, printed 


346 


10.17 


C4 


plastic or metal cans, cups white 


400 


11.76 


C5 


plastic or metal cans, cups transparent 


254 


7.47 


C6 


plastic or metal cans, cups colored w. stickers, printed 


1420 


41.74 


C7 


different tetra packs 


752 


22.10 


sum 




3402 


100.00 



To evaluate the classifier results we define the overall error rate, typed as 
wrong in the tables, the class specific cleanness typed as Cl to C7 and a rejec- 
tion rate, typed as rejected. The overall error rate is the percentage of wrong 
classified objects within the test set (without the number of rejected objects). 
An image will be rejected if both of the biggest class probabilities of an out- 
come of the TACOMA classifier are very similar. This threshold and all tunable 
parameters of the TACOMA algorithm were the same for all experiments. The 
class specific cleanness is typical for real-world sorting applications. It is defined 
as the percentage of wrong classified objects within the objects classified to a 
class. Imagine a sorting application where the objects fall in class-boxes depend- 
ing on the classification. Than this describes the percentage of wrong classified 
objects within each of the boxes - the cleanness of objects within a box. The 
rejection rate is the percentage of rejected objects as explained above (only for 
the TACOMA classifier). In all runs the TACOMA algorithm stops automati- 
cally after 20 layer have been inserted or ealier if the training set is classified 
100 % correct. Parameters have been the same for all runs. All measures within 
the tables or text are with respect to the test set. 



3.2 Results and Comparison 

As it is well known from the Curse of dimensionality the dimension of features 
should be well adapted to the density of data available for training and to the 
classifier [5], [4]. The more the dimension increases with constant number of 
feature vectors, the less dense are the data in the input space and the greater is 
the risk for bad generalization on unseen pattern. 



The eigenvalues represent the variance of the pattern projected to the directions 
given by the eigenvectors. That means the sum over all eigenvalues represent all 
information available for classification after the KLT. The plot of the eigenvalues 
sorted decreasingly gives an imagination of how many eigenvectors should be 
used for transformation. Figure 1 shows the plot of up to the 30’th eigenvalue. 
As one can see the eigenvalues decrease rapidly and quasi-linear (because of log) 
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Fig. 1. Plot of the eigenvalues sorted decreasingly. 



is entered early. Nearly 90% of variance are in the first 20 eigenvalnes. With 
this visnalization we are able to chose ronghly the nnmber of eigenvectors to be 
nsed. A more detailed view is given in Fignre 2 (left). Here the percentage of 
wrong classified images for different classifier (Cascade Correlation, K-nearest 
neighbor with k — 1,3,5 and TACOMA with and without rejection) are shown. 
The TACOMA classifier shows the best performance with 18 to 22 eigenvectors. 
Table 3 shows the averaged TACOMA results for 20 eigenvectors with rejection 
(the column headed with T) over five runs. Especially the percentages of wrong 
classified objects of low represented classes do not satisfy. Theoretically the best 
would be to add more images from those classes. But remember the general 
demands to a real-world system. Within the waste package sorting problem the 
classes are unequal represented in reality and creating training sets where all 
classes are equal represented would add cost-intensive overhead while adapting 
the system to a new task. 



4 Enhancements Using A Priori Knowledge 

Additionally to the features resulting from the KLT a color feature vector should 
be computed because it is known a priori that color plays an important role 
to distinguish between some classes, compare table 1. The color feature vector 
should give a compressed description of the pixel distribution in the RGB space 
of an image. An objects illumination varies depending on its orientation at the 
conveyor belt. To reduce the resulting inner class variance only the direction 
of the RGB vectors (given for each pixel by the values of the red, green and 
blue channel) are used. The color feature vector is computed as a histogram 
of all RGB vector directions of an image. The histogram intervals are given 
by prototype vectors computed by vector quantization. The vector quantization 
moves the prototype vectors iteratively to approximate the density distribution 
of the RGB vector directions of all images from the training set. The averaged 
TACOMA results with 20 eigenvectors and rejection (0.05 and 0.1) over five 
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Fig. 2. Percentage of wrong classified images for different classifier using different num- 
bers of eigenvectors for the KLT (left). Percentage of wrong classified images for differ- 
ent classifier using different numbers of eigenvectors for the KLT and and additional 
color feature vector of size 18 (right). 



runs are much better using these a priori knov^ledge as one can see from Table 3 
(column headed v^ith TC). The number of v^rong classified images decrease from 
18.59 to 11.13%. 

Figure 2 (right) shows the results for different classifiers using one to 30 
eigenvectors and a color feature vector with 18 dimensions. Again the best results 
are with the TACOMA classifier with 18-22 eigenvectors. 

5 Using bootstrapping to reduce the memory 
requirements 

5.1 The bootstrap algorithm 

Bootstrapping is well known as a method to manage the bias- variance dilemma 
[5], [1], [2] and offers a simple framework to improve the generalization ability 
of classifier systems. The idea is to use a number of density estimators instead 
of one and to average the class probabilities of the estimators as input to the 
decision function. For a theoretical description of bootstrapping see [5]. 

Here we show the use of bootstrapping to reduce the immense memory re- 
quirements to compute the principal components of huge sets of images. As 
explained a the smart method [11] computes the eigenvectors and -values using 
the implicit covariance matrix. To compute this matrix within an acceptable 
time all images of the training set have to be held within the computer memory. 
Additionally, the size of the implicit covariance matrix increases quadratically 
with the number of images. Assuming 3000 24 bit RGB images of size 256 x 256 
562.5 MByte for the images and 17.2 MByte for the matrix are required. The 
simple concept to reduce the memory requirements is to encompass the com- 
putation of the KLT within the bootstrap procedure. Instead to compute one 
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transformation matrix from all images, compnte a nnmber of matrices and train 
a classifier for each of them nsing a snbset of the images. The only drawback is 
the increasing time for classification of an object. Now as mnch as snbsets nsed 
for training vector matrix prodncts and neural network output vectors have to 
be computed instead of one. But using a parallel computer these jobs can be 
well distributed over a number of nodes because no communication is necessary. 



5.2 Bootstrap results 

Four subsets have been used, each containing 423 images. Instead of 128.1 MByte 
for the 1701 images and 5.5 MByte for the matrix only 31.85 MByte and 0.7 
MByte are used for each subset with 423 images. Table 2 shows the results 
achieved by the single networks (column headed with si - s4) and by the boot- 
strapped networks (column headed with boot) for a typical run. The bootstrap 
results are much better for all error rates than the results of the single classifier. 



Table 2. Bootstrapping using multiple matrices, results of a typical run, rejection 
threshold = 0.1 (left part). Summary of results for 20 eigenvectors. The columns contain 
the average rates over 5 runs (right part) 





Boot 


si 


s2 


s3 


s4 


T 

0.05 


TC 

0.05 


TC 

0.1 


TCB 

0.05 


TCB 

0.1 


wrong (%) 


10.47 


17.35 


15.72 


17.07 


15.84 


18.59 


11.13 


10.10 


12.49 


10.99 


rejected (%) 


8.47 


8.88 


9.52 


9.41 


8.70 


3.88 


2.83 


5.26 


4.08 


7.4 


Cl wrong (%) 


8.33 


57.89 


50.00 


40.00 


37.93 


41.86 


34.00 


31.43 


21.11 


13.36 


C2 wrong (%) 


10.20 


35.29 


22.45 


21.88 


31.67 


26.92 


18.56 


18.44 


11.64 


10.35 


C3 wrong (%) 


15.38 


26.53 


21.68 


31.94 


13.67 


20.12 


13.40 


11.44 


13.56 


11.47 


C4 wrong (%) 


10.49 


27.69 


27.57 


10.14 


18.18 


19.53 


15.33 


13.71 


13.68 


12.58 


C5 wrong (%) 


22.64 


29.73 


23.81 


34.48 


28.71 


29.70 


22.01 


21.18 


23.49 


21.97 


C6 wrong (%) 


9.46 


12.37 


10.93 


12.88 


14.84 


17.04 


8.23 


7.42 


11.62 


10.24 


C7 wrong (%) 


6.87 


8.71 


10.17 


11.27 


9.36 


13.77 


6.07 


5.40 


9.5 


8.44 



The averaged results of five runs are shown in table 3 (column headed with 
TCB) for rejection thresholds of 0.05 and 0.1. The overall error rates are about 
one percent above the rates of the TC runs. The differences are not large and 
might be smaller if the classes would be more equal distributed. In that case one 
should use more subsets. It seems to be a good compromise to use the bootstrap 
method. 

6 Summary and Conclusions 

Classification of non-regular shaped objects from images becomes more and more 
interesting for real-world industrial applications. We proposed an Intelligent In- 
spection Engine able to solve the very hard task of waste package sorting with 
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good results. The system fulfills the demands for self-organization, flexibility, 
robustness and manageability by non-pattern-recognition-experts. The results 
demonstrate that the techniques underlying KLT, TACOMA and bootstrapping 
are general. Especially the extension of bootstrapping over the feature extraction 
by principal component analysis is a smart method to decrease memory require- 
ments important for practical application (system costs). The scalable parallel 
implementation of the system guarantees the fulhllment of real-time demands for 
a wide range of applications. This has led to the development of a software pack- 
age called Intelligent Inspection Engine for classihcation of non-regular shaped 
objects from images. The image data set is available via ETP on request. Send 
an E-Mail to voigt@gfai.de. 
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Abstract. The convergence of fractal operator F used in image com- 
pression is investigated. A sufficient condition for eventual contractivity 
is derived by using the adjacency matrix of an influence graph which is 
determined by the fractal encoder. 



1 Introduction 

In fractal image compression scheme, the original image / is decoded as the fixed 
point / of a fractal operator F which has been built for / in the encoding stage 
(see Jacquin [5]). This operator works in a space I of images defined on a fixed 
discrete domain D and real valued: 



I = {9 \ 9 - D ^ R} . 

Using the classical contractive mapping theorem of Banach (see Banach [1], 
Dugundi and Granas [3]) the fixed point / is approximated by successive it- 
erations F°'^[xo) of the contractive operator F (see Barnsley and Hurd [2], 
Fisher [4]). However, practical use of this scheme encountered several difficulties 
as for most existing encoding schemes the essential condition of contractivity for 
F cannot be guaranteed. 

It was observed that if F is not contractive then a certain iteration F°^ may 
be contractive. This situation is recognized as eventual contractivity. In this 
paper, there is given a new sufficient condition for eventual contractivity using 
algebraic operations on adjacency matrix of an influence graph defined between 
image domains, i.e. between parts of the given image. 



2 Fractal operators 

Let / G 2. We assume that its domain D = dom{f) is a rectangular array of 
N = \D\ pixels. Two kinds of subdomains are considered in the definition of a 
fractal operator F: 

The work was sponsored in part by ECU grant CRIT2. 
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— target domains (shortly t-Domains) which create a partition 77 of 77, i.e.: 

a 

D=\Jn, 

i=l 

Ti n Tj = 0, i ^ j ; 

— source domains (shortly s-Domains) which create a cover F of 77, i.e.: 

b 

i=l 

We say that the pair (77, 7^) is a regular pair of domain sets if the following 
conditions are satisfied: 

1. each t-Domain G 77 is a discrete square of size |T^| = 2^" x 2^" pixels, 

where typically ti = 2,3,4; 

2. each s-Domain 5^ G 7^ is a discrete square of size \Si\ = 2^" x 2^" pixels, 

where typically Si = 3, 4, 5; 

3. each s-Domain 5^ G 7^ is a sum of certain t-Domains: 

Let U be domain from 77U7^. Then by xu we mean the characteristic function 
of 77, i.e. for any pixel p E U: 

^ 0 otherwise . 

We are modeling the subimage of the image g restricted to a domain U G 
II U r hj gjj = gxu- Note that 

9 = F for any gel (1) 

Ten 

The main idea in the construction of the fractal operator F for the given 
image /, is the elaboration of a matching function ji \ U ^ F such that for any 
T E the subimage ffi{T) is the most similar to the subimage /t and the size 
of /x(T) is greater than size of T. 

In simple, but practical case the similarity is defined by affine mapping acting 
separately between domains of subimages and between ranges of subimages, i.e. 
between gray scale intervals. 

The degree of similarity between sub images is measured by a p-norm || • ||p, 
1 < p < CO. 

Having 77 and //, the fractal operator is defined for any ^ G 7 as follows: 

[<^7^ ■ ^T{g) + Ot • Xt] (2) 

Ten 



where 
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— ct is the contrast between sub images ^^(t) 9t] 

— ot is the ojfset between subimages ^^(t) 9t] 

— Rt = Pt o At is the reducing mapping; 

— At is the averaging mapping effectively reducing subimage g/j,(T) by aver- 
aging pixels from a subsquare in the bigger s-Domain ja{T) and putting the 
result into the corresponding pixel of the smaller t-Domain T. For the regular 
(77, F) the mapping At = (1/4)^A^ where: 

• T is of size 2^ x 2^, 9{T) is of size 2^ x 2^, and k = s — t; 

• at certain numbering of pixels in D the mapping A^ can be written in 
matrix notation which has: 

* TV = 1 72 1 rows and N columns; 

* exactly 4^ ones in rows corresponding to elements from T; 

* only zeros in rows corresponding to elements outside of T; 

* only single one in columns corresponding to elements from ja{T); 

* only zeros in columns corresponding to elements outside of ja{T); 

— Pt is the affine permutation mapping on T, i.e. with single ones in rows and 
columns corresponding to elements from T. 



3 Standard results on convergence of fractal operators 

If we take the matrix form of the mappings Rt and of the characteristic functions 
Xt {T E II) then any fractal operator F can be written in a vector form: 

Fg = Lg + o (3) 

where g, o are vectors and L is TV x A matrix: 

L = y; Ct • Rt, o = y ot • Xt (4) 

Tgil TeiT 

Using matrix notation we can also represent the iterations of F by matrix 
powers of L: 

Lemma 1. 

For any natural k > 0: 



k-i 

F°'=(g) = L'=g+y V-Iq (5) 

i=0 

Obviously F {F°^) is Lipshitzian with factor ||F||p (||F°^||p) equal to the 
p-norm of the matrix L (L^). Hence we can easily prove that 

Lemma 2. 

1. F is contractive in p-norm if and only if ||L||p < 1; 

2. F is eventually contractive in p-norm, i.e. there exists k such that the oper- 
ator F°^ is contractive if and only if there exists k such that, ||L^||p < 1. 
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From properties of mapping At it follows immediately that for the regular 
pair (77, F), in the supremum norm || -Hoo, its operator norm IIAtII = 1- Applying 
of the permutation Pt from the left side to At results in permutation of its rows 
in matrix representation. The interchange of rows does not change the supremum 
norm. Therefore the supremum norm of Rt is equal to one too. Hence the 
supremum norm of L can be easily derived: 

Lemma 3. 

If the pair of domain sets {n^F) is regular then 

l|L||cx) = c* A max I ctI (6) 

II II TGi7 



From the lemmas 3 and 2, we get: 

Corollary 1. 

7/ c* < 1 then fraetal operator F is eontraetive in supremum norm. 

For finite dimensional vector spaces all p-norms are equivalent. Therefore 
the condition c* < 1 is also sufficient for the convergence of the fractal operator 
iterations with any p-norm: 

Theorem 1. 

Let the fraetal operator F he given by the equation 2. 7/ c* < 1 then 

1. there exists a unique fixed point f of the operator F ; 

2. for any initial image go ^F and for any p-norm (1 < p < oo): 

lim ||F-(5 o) - f\\p = 0 . 



4 Influence graph for fractal operator 

Assuming that (77, F) is regular, we are going to generalize the above results to 
eventual contractive fractal operators. 

We say that a t- Do main Ti is influenced by a t- Domain Tj if Tj is included 
in the most similar s-Domain /x(Tj, i.e.: 

Ti ^ Tj < > Tj C pfTi) (7) 

The influence graph G = (F,F) can be easily defined using the relation 
Namely, the set of vertices V is identified with integer labels of t- Domains, i.e. 
V A {1, . . . , a}, while the set of directed edges E is specified as follows: 

Note, that the influence graph is completely determined by the matching 
function fi and therefore it can be built in fractal image encoding stage. 

The following lemma explains the significance of the influence graph G : 




320 



W. Skarbek 



Lemma 4. 

1 . 

if 7^ 0 then (j, i) G ^ ; 

a 

E 

^=1 3-{j,i)^E 

3. 

a 

E=Yi E ■ ■ ■ E ^'^^2 • • • rt,, rt,, . . . rt,, . 

:/'i=i j2-(h,ji)^E jk-Ukdk-i)^E 



For the influence graph G, a weighted adjacency matrix C = [cij] is deflned 
as follows: 




By max(M) we denote a maximum element of the matrix M. Let 0 stands 
for the composition of matrices in which the operation + is replaced by max 
operation. denotes k — 1 compositions of type E: 

M*2 = M (g> M, = M ® M*'' . 



We can prove by the mathematical induction the following lemma which gives 
an upper bound for the supremum operator norm of powers of L: 

Lemma 5. For any k > 1: ||L^||cx) < max (C*^) . 

Using the lemmas 5 and 2 we give a sufficient condition for the convergence 
which is based on eventual contractivity: 

Theorem 2. 

Let the fraetal operator F he given by the equation 2. If there exist k > 0 
sueh that max (C*^) < 1 then 

1. there exists a unique fixed point f of the operator F ; 

2. for any initial image go and for any p-norm (^1 < p < cxd^; 

lim ||-F°*(5o) - f\\p = 0 . 

1^00 
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Abstract. New methodology for pattern recognition is presented which 
is based on design of invariant reference points. It is shown that the k-NN 
distance classifier is a special case of this methodology. New classifiers 
within this framework are also described. 



1 Introduction 

Pattern recognition is an area in artificial intelligence relating to synthesis and 
analysis of procedures for classifying objects on the basis of their physical mea- 
surements (for instance visual images). In case of images, the object recognition 
process consists of three main stages: 

1. object localization in the image [segmentation stage)] 

2. feature vector extraction [measurement stage)] 

3. object classifying [deeision stage). 

For instance in zip code recognition: the zip area is detected, digits are sepa- 
rated, feature vector extracted for each digit, and finally the feature vectors are 
classified. 

The recognition rate in real life systems never attains 100%. The basic reason 
is in the measurement process, which can give for two objects from different 
classes two equal or very close measurement vectors. A good recognition system 
reduces the probability of such events. 

This work concerns a new classifying methodology, called here the IRP 
method (Invariant Referenee Points). 

The IRP method as a methodology which is capable to generate a number of 
classifiers, gives no restrictions on segmentation methods and feature extraction 
methods. It is based on a construction for the given class a pair of type [xi^Fi) = 
(the reference point from the space of measurements, the operator in the space 
of measurements) such that Xi is an invariant point of i.e. Fi[xi) = xi. 

The classification of the vector y is performed on the basis of all distance 
values \\Fi[y) — y\\ computed for all reference points which are designed for 
all classes for the given recognition problem. 

It is interesting that the IRP method not only defines several known classifiers 
but it leads to new classification schemes with very simple feature extraction 
step. 
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2 Problem of classifying measurement vectors 

In any classification problem, we deal with a distinguished object class i7, which 
is subdivided into nonintersecting subclasses i = 1, . . . , n : 

i7 = U . . . U n i7 j = 0 = i ^ j. 

For each object cc G i? we can measure its TV-dimensional feature vector x = 
X{uj). The classifier using a particular vector X[uj) G elaborates a decision 
= i of the membership for the object uj from the class : 

5:R^ 

If a; G but ^(X(a;)) ^ then the decision is wrong. An optimal decision 
procedure 6 gives the minimum for the recognition error. 

Assuming that the probability of each class P{f^i) = Prob(a; G 17^) and 
probability distribution density in each class Pi{x) = Prob(X(cc)) = x\uj G 17^) 
are known. Then it can be proved that the following decision function: 

= arg max pi{x)P{uJi)^ 

l<i<n 

is optimal, i.e. has the minimum of wrong decisions. It is equivalent to so 
called maximum likelihood rule: given the measurement x = X(cc), choose the 
class index i for which the likelihood defined by the formula: 

logProb(x = X{uj) /\uj G i7^), 

is maximal. The classifier is rather of theoretical significance as the practical 
estimates of probability distributions are not known for large N and relatively 
sparse training set with cardinality L. The only exception is a flat Gaussian 
distribution when the data is concentrated around iV-dimensional subspace with 
K N. It means that a search for new effective classifiers for a particular 
application is still important. 

Certain classification problems require more than one feature vector for the 
given object (e.g.: face or fingerprint recognition). In such cases we can say that a 
generalized classifier operating on k measurement vectors xi = X(a;i ), . . . ^Xk = 
X{ujk) for certain objects cci, . . . , which belong to the same class Qj elaborates 
a decision . . . ^xjfl) = i about a membership of this sequence to the class 

Qi : 

iy:R^x...x ^ {0, 1, . . . , n}, 

k 

where the symbol 0 corresponds to no deeision category. Using a majority rule 
for results of a single vector classifier we can build a generalized classifier for 
k > 1 feature vectors: 



v{xi, ...,Xk) = majority{S{xi), . . .,S{xk)). 



The majority function can be defined in many ways. We consider here two 
definitions: 

majorityi{ii, ...,ik)= arg max |{a : ia = j}\ (1) 
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At ties, the ambiguity is dissolved by random choice of index j. 



majority2{h, ...,ik)=j if |{a : ia = j}| > ^ 



( 2 ) 



, n, consists of mea- 



If such j is missing - the result is zero. 

3 Invariant reference points 

Let us assume that the learning set C ^ i = 1, . 
surement vectors obtained for objects cc G 42^. 

The IRP method is based on a construction of an IRP collection Zi^ i = 
which is built using information included in the learning set Zi 
consists of certain number ki of pairs is jAh reference 

point in the collection while is the operator in measurement space {Fj^^ : 



R 



N 



R 



N\ 



(we call it also as eigenoperator) such that is its invariant point: 










. -)ki^ i 


v = 










?R) ' 



Let IZi = {x^\ . . . , be the set of reference points for the class Qi and 

Fi = {F^\ . . . , F^^} be the set of eigenoperators for the class 

Roughly, the IRP collection Zi approximates the learning set Xi with refer- 
ence points. Eigenoperators in proximity of their invariant points have a low dy- 
namics. This property is used in the classifier. We define here three general [8,9] 
schemes for constructing reference sets IZi : 

1. Centroid method: we define only one reference point in the class [ki = 1) 
which is defined as the centroid of the learning set Xi : 



u) = 

1 I V, I 



1 



IV 



y, 



7^, = {T}; 



yeXi 



2. Clustering method: the algorithm finds a set of reference points IZi = 
{xi\...^ optimizing a certain cost function, e.g.: 

cost(7^0 = aMSE(7^^, Xi) + pki (3) 

where o;,/? are weights, 

' y€Xi 

j(y) = arg min ||y-a;0||. 

Let us notice that the cost function 3 makes possible adequate choice of the 
number of reference points; 
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3. Learning set technique: here the set of reference points is equal to the 
learning set, i.e. IZi = This technique has a special meaning when 

the class is represented by several measurement vectors (for instance in face 
recognition we usually have only few face poses of the given person). 



4 Design of eigenoperators 



The design of the eigenoperator such that is generally hard 

problem if the only criterion is the recognition error for the resulting classifier. 
We present here three universal design techniques: 

1. Constant operator technique: for each reference point the operator 
Fj^^ is defined as follows: 

Fj'^\x) = for each x e . 

It will be shown how this technique combined with a clustering method for 
reference point design, leads to /c-NN method. 

2. Technique of projection onto principal subspace: let Vf = {y & 

: i[y) = j} be the set of learning vectors which are closer to than 
to any other reference point in IZi. If the reference points are found by the 
centroid algorithm then is the centroid of the set 

Assuming an adequate cardinality of this set we can implement the principal 
component analysis (PC A - [2,6]) with K components. PCA gives K prin- 
cipal vectors which are placed into columns of the matrix Then the 

eigenoperator can be defined by the following formula: 






x) 



X) 



J 3 



X — x^^^) for any x e R 



N 



It is obvious that F^^^ is invariant in x[^^ 



i.e. F^^^ = x[^^ 



XU vxuyvivxixkj u±±cuu j. j v cux ±cu±±u j. j j J *^7 * 

We should emphasize that the above design has a practical sense for large 
learning sets. 

3. Fractal operator technique: the algorithm performs an extensive search 
of local mappings of fragments of the vector x = x^j^ of cardinality p/c, p > 1 
(so called source domains) to fragments of cardinality k (so called target 
domains) [4,7]. 

Let us assume that k divides [N = kl), and let 77 = {Ti, . . . , T/} be the 
set of target domains, i.e. a partition of index set N} into I subsets 

of cardinality k. Let 7 = {5i, . . . , 5/} be the set of the source domains which 
are best matched to corresponding target domains, 15^1 = pk. Matching here 
denotes an optimal local affine mapping La which reduces the fragment 
x\s^ to x\t^^ where '\t^ denotes a restriction to the domain Ta. Then the 
fractal operator, i.e. the eigenoperator for the reference point x 
defined as follows: 

l 



(d • 

Xj ^ IS 



ri(d 



{y) = ^Xp{y). 



a=l 
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5 Claiss selection techniques 



In the IRP method, for the classification of a measurement vector x we consider 
a sequence of distance values Di[x) = (||a^ — F^^\x)\\, . . . , ||x — F^^\x)\\)^ i = 
1, . . . , n. The classifier uses a selection technique Sa which for the given x on the 
basis of Di[x) returns a class id, x probably belongs to. 

Let D[x) = (T)i(x), . . . , Dj^. {x)). There exists many possibilities to aggregate 
information about distances, and next the class selection. Few of them are defined 
below: 

1. Minimum distortion technique: 

51 = arg min min(T)^(x)). 

l<z<n 

The class for which the minimum distortion \\x — Fj^\x)\\ is achieved; 

2. Minimum average distortion technique: 

52 = arg min avg(T>^(x)), 

l<z<n 



where avg(T)J is the average value of the sequence Di. Here the average 
distortion introduced by eigenoperators in the given class decides about the 
choice of the given class. 

3. Basis functions technique: 



^3 



n ki 

0.5 + 

i=i j=i 



where 









X] = e 



X G R 



N 



is a basis function concentrated in the point x^^^ [5]. The coefficients cij can 
be found using recursive least square method for the learning data sequence 

4. k least distortions technique k—NZ: Let Ik{x) = be the 

sequence of class ids for which we have found k least distortions from the 
sequence \\x — Fj^\x)\\. Then: 



54 = arg max \{a : ia = j}\, 



where ia is a-th coordinate of the sequence Ik{x). In this approach we choose 
the class id which occurs in k least distortions most frequently. 



6 The IRP method definition 



Let Xb) c be the set of learning sequence for the class i7^, 
pose that for these learning sets we can build IRP ensembles Zi 




i = 1, . . . , n. Sup- 
= (A:,; (+, + ), 



Let X G R^ be a testing measurement vector. Then the classification proce- 
dure in the IRP method is of the form: 
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1. Compute sequences Di{x) = {\\x — Fj^^\x)\\)^ 1 < j < i = 1, . . . ,n; 

2. Let D{x) = {Di{x),...,Dn{x)); 

3. Assign X to the class with id 6{x) = Sa{D[x)). 

Let us notice that using one of majority rules (1,2) we can generalize the IRP 
method to sequences of measurement vectors. 

The above algorithm is of generic type. A specific algorithm is obtained by 
specifying: 

1. a method for constructing the IRP ensemble Zi, i = 1, . . . , n; 

2. a norm \\x — y\\; 

3. a selection technique 5^; 

4. majority rule (for sequence recognition only). 

In the figure 1 we give a graphical intuition for an IRP classifier. Concen- 
tric circles stand for A^-dimensional spheres. Left family of concentric spheres 
illustrate the eigenoperator of the first class. The larger sphere is mapped onto 
a smaller sphere by the eigenoperator. Now if x belongs to the smaller sphere 
with the center in the reference point of the first class and at the same time it 
belong to the larger sphere with the center in the reference point of the second 
class than as it is shown on the picture the distance \\x — T^^^^(x)|| is smaller 

/n\ 

than ||a; — '(a;)||. Therefore x is assigned to the first class. 




Fig. 1. Illustration of the IRP method. 
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7 Specifying some known classifiers by Invariant 
Reference Points method 

In this section, we show that several known classifiers which are based on distance 
from a pattern can be specified as special cases of the IRP method. 

1. The minimum distance from reference points method [8,9,3]: Let 

us assume that in the i-th class there are 

reference points IZi = {x^ \ . . . Building an IRP ensemble we choose 

constant operator technique, i.e. we choose Fj^\x) = for each x G . 
Let us notice that the distortion introduced by the operator equals to 
the distance of x to the reference point x^^^ : 

\\x - F^'"\x)W = \\x - (4) 

Therefore by choosing the selection technique S\ which is based on the min- 
imum of the distortions we get a classifier which chooses the class to which 
the closest reference point belongs: 

6(x) = S\(D(x)) = Qxg min min \\x — x^^^\\: 

2. k nearest neighbors method (/c— NN) [3]: 

Similarly to minimum distance method, we take constant operators for the 
IRP ensemble too. As the selection technique we choose the technique ^4 of 
k least distortions. Then the sequence Ik{x) = (A, • • • Afc) fhe sequence 
of k least distances to reference points. The technique ^4 chooses the most 
frequent class in the sequence Ik{x)^ i.e. the class which is the most frequent 
class within k nearest neighbors of the testing vector x : 

6{x) = S/^{D{x)) = arg max \{a : ia = j}|. 

This is a decision function of the very popular A;-NN classifier which appeared 
to be very effective in many applications (e.g.: crops estimates based on 
satellite LANDSAT pictures); 

3. Radial basis functions clsissifier [5,1,11]: Suppose that we have only 
single reference point x^^ for the class Choosing the constant eigenop- 

erators F^^^ and the selection technique 5s, we get the following decision 
function: 

n 

6{x) = Ss{D{x)) = + . 

i=l 

Denoting Wi = c^i, x^'^^ = x^^ let us consider the function 

n 

i=l 
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The function g[x) is a linear combination of Gaussian basis functions. There 
are well known techniques searching parameters and Wi such that g[x) 
estimates the given function / which is known only by specifying of learning 
pairs (^, i), i.e. measurement vectors y for learning objects and class ids 
(i = 1, . . . , n), the object belongs to. The obtained classifier is the known 
radial basis functions neural network. 

8 Conclusion 

New methodology for pattern recognition was elaborated. It is based on design of 
invariant reference points. The methodology is capable to define new classifiers 
such as fractal operator classifying system. It is shown that many prominent 
classical recognition schemes, for instance the k-NN distance classifier, are special 
cases of the IRP method. 
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Abstract 

In image compression using wavelet transforms the final stage of process- 
ing often involves entropy encoding, out of which arithmetic coding is most 
essential. A significant contributor to the effectiveness of the arithmetic encod- 
ing is the selection of coding contexts. We show for various context selection 
schemes, that the interbit correlations in the multi-symbol alphabet is a pri- 
mary source of compression gain in the entropy coding of the image. Further, 
we analyze the use of more conventional context selection schemes and show 
that full image histograms contain information not yet available to the decoder 
in embedded algorithms. The use of predictors in the embedded algorithm can 
be quite ineffective. 



1 Introduction 

Methods of image compression involving the use of orthonormal wavelet trans- 
forms have proven to be effective [1--3]. Shapiro [4], noticed that, since wavelet 
transforms retain pixel spatial positions, an insignificant coefficient in the lower 
frequency subbands likely entails the existence of insignificant coefficients in the 
same spatial locations in the higher frequency subbands. He introduced a method 
of encoding entire trees of zeroes with a single symbol, called embedded zerotree 
wavelet (EZW) coding. The essence of the EZW method is scalar quantisation 
based on notion of significance: 

Given n, if \cij\ > 2^, a coefficient Cij is significant with respect to a given 
threshold n; otherwise it is called insignificant The similar notion of significance 
holds for sets (at least one of its elements has to be significant). 

The basic structure for wavelet transformed coding of images is a three- 
directional pyramidal subband structure [2, 4] coming from the decomposition. 
Pixels of the lowest LL (low-low subbband) are roots of trees. The parent-child 
relationship and dependency is central to the idea of zerotrees. In an embedded 
coder trees are processed piecewise by scanning coefficients in the transformed 
image through subbands in an assumed order. In the Shapiro algorithm each 
coefficient was examined independently of any others (no context model). Fol- 
lowing Shapiro, Said and Pearlman (S&P) proposed an algorithm using ” spatial- 
orientation trees” which was an extension of the idea of zerotrees [5], but with 
some significant implementational differences. 

In all of the schemes the final stage of processing involves entropy (arithmetic) 
coding [6], based on a selection of coding contexts. The theoretical underpinning 
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of gain due to a context use is mutual information I(X,Y). This quantity is 
non-negative and equal 



I(X,Y) = H(X) ^ H(X\Y), 

where H(X) is entropy of distribution X and H(X\Y) the conditional entropy of 
X due to the knowledge of Y, which can be treated as a context. 

With S&P coder (one of the best existing) as a basis, we examine the subject 
of context selection for embedded wavelet image codecs. In section two we discuss 
the use of a multi-symbol alphabet versus a binary alphabet within the scope of 
the Said and Pearlman scheme. In section three we discuss the use of popular 
context selection strategies in embedded coders and present sample results from 
extensive experimental data. 



2 Analysis of Said and Pearlman Context Selection 

The first two stages of the S&P algorithm, the significants selection and the 
descendants selection, correspond to what Shapiro called the dominant pass in 
his original paper. In the significants selection section, each coefficient is deter- 
mined to be significant or insignificant relative to the current threshold level; 
in the descendants selection section, the coefficients are determined to have or 
not to have significant descendants, thus informing the decoder as to whether 
or not to proceed down the tree or whether the rest of the tree is a zerotree 
relative to the current threshold level. The resulting bitstream from these two 
sections are entropy coded in the third stage in sets of a quartet of coefficients 
(in a 2x2 block) rather than on a coefficient by coefficient basis. A state is stored 
for each of the quartets as shown in Figure 1, where i represent 

state information for the (0, 0), (0, 1), (1, 0) and (1, 1) coefficients of the quartet 
respectively, and x with subscripts denotes {s, d} significant (s) and descendant 
significant (d) states for each of the quartet coefficients. For instance, if and 
Ba have the value 1, then this indicates that the coefficient in the (0,0) position 
of the quartet has already been found to be significant, furthermore, that the 
coefficient in the (0, 1) position has been found to have a significant descendant. 
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Figure 1: State structure for the coefficient quartet. 
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2.1 Said and Pearlman Multi-Symbol Alphabet 

Each quartet of coefficients is classified into a set of equivalence classes and 
then assigned numbers which act as the index into context table used by the 
arithmetic coder. Thus, contexts will have a variable number of symbols (ie, bit 
length) based on the degrees of freedom of the equivalence class. There are 34 
such equivalence classes for the significance selection section, and 34 equivalence 
classes plus 5 special case classes for the descendants section. As an example of 
the context formation, the context indexed as number 2 has all states with 2 
coefficients still not found to be significant while all the others have been found 
significant. This means that there are 2 degrees of freedom implying a 4 symbol 
alphabet for this context. In Table 1, we show the entropy breakdown produced 
from the bitstream of the significance selection section of the S&P. The average 
bit compression for the conditional entropy is 0.889. 



Model 








bits per 


(S = n) 


Total 


H(X|S) 


Comp 


p(X=n) 


s^Tnbol 


17 


94 


3.888 


0.972 


0.002 


4 


18 


525 


1.000 


1.000 


0.011 


1 


19 


450 


1.991 


0.996 


0.010 


O 


20 


150 


2.961 


0.987 


0.003 


3 


21 


664 


1.000 


1.000 


0.015 


1 


22 


300 


1.982 


0.991 


0.007 


O 


23 


809 


0.996 


0.996 


0.018 


1 


24 


378 


1.969 


0.984 


0.008 


O 


25 


337 


2.915 


0.972 


0.007 


3 


26 


156 


3.852 


0.963 


0.003 


4 


27 


775 


1.945 


0.972 


0.017 


O 


28 


454 


2.921 


0.974 


0.010 


3 


29 


2133 


1.938 


0.969 


0.047 


O 


30 


507 


2.831 


0.944 


0.011 


3 


31 


665 


3.605 


0.901 


0.015 


4 


32 


3952 


2.760 


0.920 


0.087 


3 


33 


11663 


3.358 


0.839 


0.255 


4 


total 


45678 





Model 






rn 


bits per 


(S=n) 


Total 


H(X|S) 


Comp 


II 


s^Tnbol 


0 


2463 


1.000 


1.000 


0.054 


1 


1 


2586 


1.983 


0.991 


0.057 


O 


o 


3112 


2.839 


0.946 


0.068 


3 


3 


11440 


3.141 


0.785 


0.250 


4 


4 


0 






0.000 


NA 


5 


466 


1.000 


1.000 


0.010 


1 


6 


o 

o 


1.997 


0.998 


0.004 


O 


7 


48 


2.765 


0.922 


0.001 


3 


8 


0 






0.000 


NA 


9 


234 


1.000 


1.000 


0.005 


1 


10 


62 


1.987 


0.993 


0.001 


O 


11 


0 






0.000 


NA 


12 


100 


0.977 


0.977 


0.002 


1 


13 


0 






0.000 


NA 


14 


331 


0.997 


0.997 


0.007 


1 


15 


413 


1.981 


0.991 


0.009 


O 


16 


211 


2 929 


0.976 


0.005 


3 



Table 1: mth order conditional entropy per context for Lenna at 1.0 bpp rate. 



Two rows (most high frequency and contributing the most towards the com- 
pression ratios contexts) are highlighted in the table. These contexts are: context 
3 - all nodes with significant descendants found, but no significant members in 
the quartet; and context 33 - no found significants in the quartet, neither descen- 
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dants nor quartet members. In the second case there are 4 degrees of freedom 
implying 16 symbols for this context. 



2.2 Binary Alphabet Performance 



The multi" symbol alphabet employs the mth order entropy of the symbol. A 
binary alphabet would employ a zeroth order entropy and is more efficiently 
handled by arithmetic coders. In Table 2, we show the results of recomputing 
the entropy of the bitstream into the zeroth order. 
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(S = n) 
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H(X|S) 


p(X=n) 
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1241 
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0.017 
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0.003 
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196 
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9 
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234 


1.000 


0.002 


10 


58 


66 


124 


0.997 


0.001 


11 


0 


0 


0 


0.000 


0.000 


12 


41 


59 


100 


0.977 


0.001 
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155 
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0.995 


0.006 


16 
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270 
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0.984 


0.004 
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Input symbol 








(S = n) 


0 


1 


Total 


H(X|S) 


p(X=n) 


17 


218 


158 


376 


0.982 


0.003 


18 


261 


264 


525 


1.000 


0.004 


19 


463 


437 


900 


0.999 


0.006 


20 


251 


199 


450 


0.990 


0.003 


21 


338 


326 


664 


1.000 


0.005 


22 


331 


269 


600 


0.992 


0.004 


23 


373 


436 


809 


0.996 


0.006 


24 


422 


334 


756 


0.990 


0.005 


25 


590 


421 


1011 


0.980 


0.007 


26 


370 


254 


624 


0.975 


0.004 


27 


925 


625 


1550 


0.973 


0.011 


28 


787 


575 


1362 


0.982 


0.010 


29 


2516 


1750 


4266 


0.977 


0.030 


30 


967 


554 


1521 


0.946 


0.011 


31 


1785 


875 


2660 


0.914 


0.019 


32 


7870 


3986 


11856 


0.921 


0.083 


33 


33650 


13002 


46652 


0.854 


0.327 



Table 2: Zeroth entropy from contexts using binary alphabet at 1.0 bpp. 



For the descendants selection section in the tree most of the gain originates 
from the use of ” escape” routines which involve special contexts; these encode 
the predominant zero condition in the bitstream. For the lack of space we do not 
give results. The interbit correlations are higher in descendant selection than 
for significance selection - mth order entropy is much lower than the binary 
entropy and significantly lower than for the significance selection portion of the 
algorithm. 
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3 Conventional context selection schemes 

Conventional context selection schemes typically involve the use of linear pre- 
dictors which are formed by employing a weighted linear combination of these 
variables - neighboring coefficients (and/or parent coefficient). In contrast, the 
Said and Pearlman context does not use any additional information external to 
the quartet. In particular, it does not employ the parent- child correlation. In or- 
der to study context formation, we use an extended version of the technique used 
by Buccigrossi and Simoncelli [7]. They showed the correlation between magni- 
tudes of predictors and the coefficients to be predicted by constructing a joint 
histogram of the coefficients. In addition, they compute the mutual information 
as a measure of gain in coding obtained from knowledge of the conditional infor- 
mation. In this paper, we compute the mutual information from the full image 
histograms. Then, the equivalent values from data that is obtained from an em- 
bedded implementation of the image oder. As a simplified case we have verified 
that the histogram results (not shown here) for the full Lenna image, and a 
single predictor (the parent coefficient) reveal a strong magnitude correlation. 

3.1 Full Image Joint Statistics 

The context variables used in this study are the parent, the upper neighbor, 
the left neighbor and the cousin (ie, in the same spatial position as that of 
the coefficient to be predicted but located on a different subband) as shown in 
Figure 2. 




Figure 2: Neighboring coefficients to be used in linear predictor construction. 



In Table 3, on the left we show the mutual information computed from such 
histograms for each of the four predictor coefficients for each of the decomposition 
directions. In order to use the full predictor, we compute the optimal weights to 
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form the weighted linear combination of the predictor variables (the neighboring 
node values). By minimizing the MSE of the predictor 

w = E(P © pTy^EiC 0 F), 

where w is the vector of parameter weights, P is the set of parameter magnitudes, 
C is the coefficient magnitude being predicted, and E(') is the expected value. 

Computing the mutual information of the full predictor, we get the values 
on the right side of Table 3. We see that the compression percentages are very 
competitive with those of Said and Pearlman^s. However, it is important to note 
that the gain that can be expected from the use of the predictor variable, as 
indicated by the individual mutual information results, is not additive towards 
the construction of the full predictor as can be seen when comparing the %comp 
values of both tables in Table 3. This, we believe, is caused by some correlation 
between the predictor variables as well, and that there exists an overlap in the 
conditional knowledge provided by each predictor variable. 
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vertical 
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8.251 


Left horizontal 
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0.225 


9.554 


diagonal 


1.997 
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0.15 
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vertical 
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11.65 
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1.997 
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1.973 


0.16 
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Cousin horizontal 


2.355 


2.218 


0.136 


5.775 


diagonal 


1.997 


1.861 


0.136 


6.81 


vertical 


2.133 


1.987 


0.146 


6.845 



Table 3: Mutual information of coefficient histograms for Lenna. 



3.2 Conventional Context in Embedded Algorithms 

The problem with the mutual information values in the table above is that it 
assumes full knowledge of the image statistics. However, as previously mentioned, 
this is not true of embedded coders in the style of Said and Pearlman, and 
significant side information would be required. 

We apply the procedure above to the statistics available to the decoder as 
it actually executes in the Said and Pearlman code in order to compare the 
possible gains from using a different type of context. The data presented in Table 
4 applies only to the significance selection portion of the codec. The predictor 
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index indicates which of the predictors is available for use at the time the current 
coefficient is being evaluated (ie, its value predicted), the zero case is not included 
in the table. The predictor weights were computed for the data per pass of 
the algorithm. Table 4 contains entropies of the horizontal direction coefficients 
(results for diagonal and the vertical direction are similar). The overall entropies 
are in general worst than those of the simple Said and Pearlman contexts. More 
importantly, the results provide much less compression gain than is expected 
by inspection of the results of a computation of the mutual entropy from the 
knowledge of the full image histogram. 
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Table 4: Conditional entropy of the coefficient given the predictor (horizontal 
direction). 



Several factors contribute to the worse than expected results. First, the con- 
text is binary; we have seen that for both the significance selection and descen- 
dants determinations portions of the embedded codec, the interbit correlations 
are considerable and in this case is not capitalized upon. Second, there is signifi- 
cant decrease in performance from not having all the predictor variables available 
at the time of evaluation of each coefficient. Third, there is a problem in com- 
puting the weights on the per pass basis which would also affect any other less 
detailed version, and this is that the inverse of the autocorrelation function is 
not always possible to obtain. We have to average over possibly a large number 
of passes worth of bits. This averaging will result in errors in prediction which 
will increase the further down the number of passes the averaging has to take 
place. 
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4 Summary 

We have evaluated various context selection schemes. We have shown that the 
interbit correlations are the primary source of compression gain in the entropy 
coding of the image. Further, we have studied the use of more conventional 
context selection schemes and shown that, in this case, the gain expected from a 
study of the full image histograms can be deceiving and that one has to analyze 
the predictor in the light of the information currently available to the decoder; 
under these constraints using predictors in the embedded algorithm can be quite 
ineffective. We have found that all patterns of results applied not only to the 
Lenna image case, but held over all the images we have tested; among these were 
the commonly used images of: baboon, boats, goldhill, peppers, and cheyenne, 
and over a broad range of image types and at a wide range of bit rates. 

Recently, Marpe and Cycon [8] improved somewhat upon the Said and Pearl- 
man code. They use run-length coding prior/instead of arithmetic coding with 
no bitrate gain but 30% complexity reduction. This is probably where most of 
the (small) gain comes from in the non-context/predicted version. They, how- 
ever, got up to 0.2 dB gain using Pearlman type maps but then employed the 
non-embedded fashion, which allows them to use more information for context 
formation. 

C. J§drzejek acknowledges the partial support by the Polish Scientific Com- 
mittee (KBN) grant 8T11E035 10, and ECU grant CRIT2. 
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Abstract. Markov networks utilize nonembedded probabilistic condi- 
tional independencies in order to provide an economical representation 
of a joint distribution in uncertainty management. In this paper we study 
several properties of nonembedded conditional independencies and show 
that they are in fact equivalent. The results presented here not only show 
the useful characteristics of an important subclass of probabilistic condi- 
tional independencies, but further demonstrate the relationship between 
relational theory and probabilistic reasoning. 



1 Introduction 

Belief networks [5,6] utilize probabilistic conditional independencies to provide 
an economical representation of a joint distribution for managing uncertainty. A 
Bayesian network is a directed acyclic graph, explicitly specifying the embedded 
and nonembedded independency information, coupled with a corresponding set 
of conditional probability distributions. As performing inference in such a net- 
work may easily become intractable [2,3], it is useful to transform a Bayesian 
network into a Markov network [5] albeit sacrificing the embedded indepen- 
dencies. That is, a Markov network is defined with respect to a hypergraph 
which explicitly specifies nonembedded conditional independencies only. This 
subclass of nonembedded independencies, called generalized multivalued depen- 
dencies (GMVDs) [13] , is defined in a similar fashion to multivalued dependencies 
in relational database theory. 

In this paper we study several properties of GMVDs and show that they are 
in fact equivalent. The results presented here not only show the useful charac- 
teristics of this important subclass of conditional independencies, but further 
demonstrate the intriguing relationship between relational database theory and 
probabilistic reasoning. It is perhaps worth mentioning that probabilistic condi- 
tional independencies do not have a complete axiomatization [8,9] contrary to 
Pearl’s conjecture [6]. However, it has been shown that GMVDs do in fact have 
a finite complete axiomatization [12]. 

Although the discussion in this paper draws heavily from [1], the exposition 
presented here is more general. For instance, in relational databases multivalued 
dependency [4] is a necessary and sufficient condition for a relation to be loss- 
lessly decomposed into two projections. However, multivalued dependency is a 
necessary but not a sufficient condition for the defining the notion of probabilistic 
conditional independence [13]. 
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This paper is organized as follows. Section 2 contains background knowledge 
of our extended relational model [10-13] for probabilistic reasoning. In Section 
3, we analyze several properties of GMVDs and show that they are in fact 
equivalent. 

2 Background 

2.1 Hypergraphs and Hypertrees 

A hyper graph [1,7] is a pair (A/*, R), where A/" is a finite set of nodes and R = 

• • • 7 Rfi} is a finite set of hyperedges which are arbitrary subsets of A/*. 
An ordinary undirected graph without self loops is a hypergraph whose every 
hyperedge consists of two nodes. In this paper we assume M and 

henceforth will often refer to the hypergraph R without mentioning the set A/" 
of nodes. 

We call an element G R, a tvrlg^ if there exists another distinct element 
Rj G R, such that Pi (U(R — {A^})) = RiORj. (By this definition, the hyper- 

edge in a hypergraph consisting of a single hyperedge is not a twig). This means 
that the intersection of Ri and the hypergraph is contained in one hyper edge of 
the hypergraph. We call any such Rj a branch for the twig Ri^ and note that a 
twig Ri may have many possible branches. A hypergraph is called a hypertree 
(an acyclic hypergraph [1,7]) if its elements can be ordered, Ai, A2 , . . . , A^, such 
that Ri is a twig in the sub-hypergraph Ai, A2 . . . , A^ for i = 1, . . . , n. We call 
any ordering satisfying this condition a hypertree construction ordering for R. 
(A hypertree construction ordering can also be represented as a join tree [1].) 
The first hyper edge R\ in the hypertree construction ordering is called the root. 
Given a particular hypertree construction ordering, we can choose an integer 
6(i), for i = 2 , . . . , n, such that 1 < h{i) < i — 1 and Rh{i) is a branch for Ri in 
Ai, A2, . . . , A^. We call such a function b[i) satisfying this condition a branching 
function for R. Note that a particular construction ordering may have many 
branching functions. 

For example, the hypergraph R = {Ai = {Ai,A 2 , A 3 }, A 2 = {Ai,A 2 ,A 4 |, 
A 3 = {Ai, A 2 , A 5 }, A 4 = {A 5 , Ae}}, depicted in Figure 1, is in fact a hypertree; 
for instance, the ordering Ai, A2, A3, A4 is a hypertree construction ordering and 
6(2) = 1,6(3) = 1,6(4) = 3 defines one possible branching function. 

A path from a node Ai to a node Aj is a sequence of hyperedges Ai , A2, . . . , Aj^ 
(A: > 1) such that A^ G Ai, Aj G Rk^ and Ri Pi A/qi 7^ 0 if 1 < / < A:. We also 
say that the above sequence of hyperedges is a hyperedge path (or simply path 
when no confusion arises) from Ai to Rk- 

Two nodes (attributes) are connected if there is a path from one to the other. 
Similarly, two hyperedges are connected if there is a hyperedge path from one to 
the other. A set of nodes or hyperedges is connected if every pair is connected. 
The connected components are the maximal connected set of edges or hyperedges. 

The following undirected graph terminology will also be useful in describing 
characteristics of GMVDs. A clique in a graph is a set of nodes such that every 
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Fig. 1. A graphical representation of the hypergraph R = {Ri, R 2 , Rs, R 4 }. 



pair forms an edge of the graph. A cycle in a graph is a sequence (Ai, A 2 , . . . , Am) 
of nodes (m > 3) such that each Ai is distinct, except that Ai = and 
(A^, A^yi) is an edge for 1 < i < m. 

Let R be a hypergraph. The graph of R, denoted G(R), has the same nodes 
as R and an edge between every pair of nodes that are in the same hyperedge of 
R. Thus, the edges of G(R) are precisely the set of all pairs (A^, Aj) for which 
there is a hyperedge G R with Ai , Aj G 

2.2 Extended Relational Definitions 

Let A/" be a finite set of distinct variables and let X C A/". Following [1], we define 
a A -tuple (or simply a tuple if X is understood) to be a function with domain 
A. Thus a tuple is a mapping that associates a value with each attribute in A. If 
y C A and t is a A-tuple, then t\Y] denotes the T-tuple obtained by restricting 
the mapping to T . A A-relation (or a relation over A, or more simply a relation 
if A is understood) is a finite set of A-tuples. If r is a A-relation and T C A, 
then the projection of r onto T, denoted r[T], is the set of all tuples t[Y] where 
t is in r. 

We now extend the traditional relational concepts in order to express cor- 
responding probabilistic concepts used in managing uncertain knowledge. Let 
r be a fixed relation representing the domain of a finite set of variables A/". A 
joint probability distribution [5,6] over r, is a function ^ on r assigning to each 
tuple t G r a real number 0 < (j){t) < 1 such that ^ 

distribution is over M when the domain r is understood, and write (j) as ^a/"-) 
Suppose is a joint probability distribution over A/" = {Ai, A 2 , . . . , A/}. We 
can succinctly express the probability distribution (jx extended, relation 

<Pj\f with attributes {Ai , A 2 , . . . , A/, as shown in Figure 2. Each row in <Pj\f 
corresponds to a tuple ti ^ r^ and s is the cardinality of r. 

Let (jx and (jy be two distributions over A and Y respectively. We can 
express the product of these distributions (jx * <I>y as the product join of the 
extended relations ^x and denoted by ^x X as follows: 

<Px X <Py is the extended relation obtained by adding a new column with 
attribute fcpx-4>Y relation r = ^x[A] N <Py[Y]^ where N is the natural 

join operator [1]. (r is referred to as the domain of <Px x ^y.) For each tuple 
t G X t[XY] = r[XY], and t[/^,.^,] = cjx{t[X]) • ^y{t[Y]), 
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Ai A2 ... Ai 

ti[Ai] ti[A2] .. . ti[Ai] 

t2[Al\ t2[A2] ■ ■ ■ t2[Al] t2[f4,j^\ = (j>u{t2[^P\) 



ts[^i] ts[A2] .. . ts[Ai] ts[f4,j^] = </>Ar(C[A/l) 



Fig. 2. A joint distribution expressed as an extended relation j\f . 



If is an extended relation and X J\f ^ then the marginalization of <Pj\f 
onto X is the marginal distribution, denoted with attributes X U {C^x}, 

defined by: 

= \ t[X] e <Pm[X], and = E 



where t[X] = 



Due to limited space, we refer the reader to [ 10 - 13 ] for examples illustrating 
the product join x and marginalization | operators. 

Let <Pj\f be an extended relation and X C J\f . The inverse extended relation 
)~^ is defined from by renaming attribute f^ix as /^^ix^_i, and 






0 otherwise. 



If <P is an extended relation over A/" and A, T C A/", then we say that the 
generalized^ multivalued dependency (GMVD) [ 12 ], denoted X — cw Y holds for 
^ if 



# = ® X X 

where A = J\f — XY ^ and ® is called the generalized^ join operator. 

For example, consider the joint distribution <Pr on R= {Ai, A2, A3, A4, A5} 
shown in Figure 3 . <Pr satisfies the GMVD {Ai} — cw {A2, A3, A4} since <Pr = 
^1{Ai,A 2,A3 Ad} ^ As} shown in Figure 4 . 

We say that an extended relation <Pr over attributes A! = iti U A^2 U . . . U Rn 
obeys the generalized^ acyclic join dependency (GAJD), denoted G{A^i, . . . , Rn\ 
or (g)R, \l<pR can be expressed as 

= (. . . (g) (g) . . . (g) 



where iti, . . . , Rn is a hypertree construction ordering for R = {Ai, A2, . . . , A^}. 

Let ^ be a set of dependencies and a be a single dependency. We say that 
^ logically implies a if whenever every dependency in ^ holds for an extended 
relation then a also holds for <P. That is, there is no “counterexample extended 
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Fig. 3. A joint distribution cj}R on R= {Ai, A2, A3, A4, A5} expressed as the extended 
relation 4 ^r. 




Fig. 4. The distribution <Pr in Figure 3 satisfies the GMVD {Ai} — {A2,A3,A4} 
since ^ D2 = fMA„A,y, 

Dz = f 



A4} , A5} 






relation” <P such that every dependency in ^ holds for <P but a does not hold 
for <P, 

Since the concept of nonembedded conditional independence is the focus of 
our discussion, we conclude this section by distinguishing it from an embedded 
conditional independence. Consider the joint probability distribution (j)R on R = 
{Ai, A2, A3, A4, A5} as shown in Figure 3 . It can be verified that the conditional 
independence of {A2} and {A3} given (Aij does not hold with respect to (pR. 
That is, 



(pR r 



ji{^lA2} ii{Ai,A3,A4,A5} 





However, the conditional independence of {A2} and {A3} given {Ai} holds in 
the marginal distribution namely: 



^1{A„A2,A3} 



(^fj)^iA,A2,A3}y{Ai,A2} . (^fjyi{^lA2,A3}y{Ai,A3} 



We call such an independency an embedded conditional independency with re- 
spect to the distribution (pR, 
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3 Equivalent Characterizations of GMVDs 

In this section we derive several characterizations the set M of generalized mul- 
tivalued dependencies (GMVDs) such that M is the consequence of a given 
generalized acyclic join dependency (GAJD). Since it has been shown [12] that 
every GMVD X — Y is equivalent to the GMVD X — V — X, we will 
only consider those GMVDs with X OY = 0 to simplify the notation. Thus 
will denote the set of all GMVDs X — cw V, with X OY =0, that are logically 
implied by the set M . 

If is a hypergraph, then the set of GMVDs generated hy 1~L is the set of 
GMVDs X — cw V, where Y is the union of some connected components of the 
hypergraph 1~L — X obtained from 1~L by deleting the set X of nodes. That is, 
— X = {E — X I X is an edge of H} — {0}. We then say that X separates 
ojfY from the rest of the nodes. A set M of GMVDs is hypergraph generated if 
there is a hypergraph that generates M. Similarly, M is graph generated if there 
is a graph (treated as a hypergraph) that generates M . 

A GMVD X — cw Y (A n V = 0) splits two attributes Ai and Aj if one of 
them is in Y and the other is in A/" — A V , where X is the set of all the attributes. 
A set M of GMVDs splits Ai and Aj if some GMVD in M splits them. 

Lemma 1. Two attributes Ai and Aj are split hy a set M of generalized, multi- 
valued dependencies if and only if they are split hy its closure. 

Lemma 1 indicates that two logically equivalent sets of GMVDs split exactly 
the same pairs of attributes. 

Given a set M of GMVDs, we can construct a graph G[M) with the attributes 
as nodes and an edge (A^, Aj) between two attributes Ai and Aj if Ai and Aj 
are not split by M. For example, let ff — {Ai, A 2 , A 3 , A 4 } and M — {Ai — cw 
A 3 , A 3 — A 4 }. The first GMVD splits A 2 and A 3 , and A 3 and A 4 , while the 
second GMVD splits Ai and A 4 , and A 2 and A 4 . The set of edges in the graph 
G{M)\s {(Ai,A2),(Ai,A3)}. 

Lemma 2. Let M he a set of GMVDs^ G[M) its graphs and N the set of 
GMVDs generated hy G{M), Then C N, 

The converse to Lemma 2 does not necessarily hold. In the last example, the 
GMVD 0 — cw A 4 is graph generated by G[M) but is not logically implied by 
M. It will be shown that the converse holds exactly for those sets of GMVDs 
that form a cover of the set of GMVDs implied by a given generalized acyclic 
join dependency, where M\ is a cover of M 2 if M^ = M^ ^ 

We say that M has the intersection property if whenever the GMVDs A 
A and Y — cw A are implied by M (with A disjoint from both A and V), then 
A n V — A is also implied by M. 

Let M be a set of GMVDs. Two disjoint sets A and Y are called orthogonal 
if the GMVD J\f — XY — A (or equivalently, by the complementation rule 
for GMVDs [12], J\f — XY — Y) is implied by M. It follows from Lemma 1 
and the rules for manipulating GMVDs [12] that two attributes Ai and Aj are 
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orthogonal (i.e., the singleton sets {Ai} and {Aj} are orthogonal) if and only 
if they are split by M. It follows from the rules for manipulating GMVDs [12] 
that if X and Y are orthogonal, then for every pair Ai and Aj of attributes 
where Ai ^ X and Aj ^ Y ^ necessarily Ai and Aj are orthogonal. We say that 
M has the orthogonal property if the converse also holds. M has the orthogonal 
property if every two sets X and Y are orthogonal whenever every attribute of 
X is orthogonal to every attribute of Y . 

Theorem 1. Let M he a set of generalized multivalued dependencies. The fol- 
lowing are equivalent: 

(1) M is a cover of the set of GMVDs implied hy some GAJD. 

(2) is hypergraph generated. 

(3) is graph generated. 

(4) There is exactly one graph that generates . 

(5) M+ is the set of GMVDs generated hy G{M). 

(6) has the intersection property. 

(7) has the orthogonal property. 



Proof. We will show that (1) and (2), and (2) and (3) are equivalent. We then 
show (3) ^ (6) ^ (7) ^ (5) ^ (3). Hence, conditions (1), (2), (3), (5), (6), (7) 
are equivalent. Finally, we show (5) ^ (4) ^ (3) which shows (4) is equivalent 
to the others. 

(1) (2): The set of GMVDs implied by a generalized acyclic join dependency 
GR is exactly the set of GMVDs generated by the hypergraph R [12]. Thus, if 
M is the set of GMVDs implied by the generalized acyclic join dependency GR, 
then we know that M+ is the set of GMVDs generated by the hypergraph R. 

(2) ^ (3): Let 77 be a hypergraph, and let G = G{TL) be the graph of 77. It is 
easy to see that a set X of nodes separates off another set V in 77 if and only 
if X separates off V in G. Therefore, the set of GMVDs generated by 77 is the 
same set of GMVDs generated by G. 

(3) ^ (2): Obvious, since every graph is a hypergraph. 

(3) ^ (6): Let 77 be the hypergraph that generates A7 + . Suppose X — G 
and Y —o^ Z are in A7 + . Since X — G is in A7 + , X separates off G 
from M — XZ . Thus, no node in G is connected to a node 'm J\f — XZ. Since 
Y — G is in A7 + , V separates off G from J\f — Y Z . Thus, no node in G is 
connected to a node m J\f —Y Z. Therefore, no node in G is connected to a node 
m JV — [{X DY) U Z]. Therefore, XoY separates off G from JV — [(X fl V ) U G] . 
By definition, X PiV — G is in M+. 

(6) ^ (7): Let X = {Hi, H 2 , . . . , Ak} and Y = (5i, ^ 2 , • • • , be two disjoint 
sets with every Ai orthogonal to every Bj. Let G = A/" — XT,X^ = X— = 
1, 2, . . . , A:), and Yj = Y — Bj[j = 1, 2, . . . , m). Since every Ai is orthogonal to 
every 77j, we have ZXiYj — Ai for each i and j. Since M has the intersection 
property, we have n{GXXj | 7 = 1^2, . . . , m} — cw Ai. But n{GXXj | j = 
1,2,..., m} = ZXi. Hence, ZXi — X, or equivalently by the complementation 
rule for GMVDs [12], ZXi-c^ Y for each i. Again from the intersection property. 
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r\{Z)ii I i = 1, 2, . . . , A:} — E. Since | i = 1, 2, . . . , A:} = Z, we have 

Z — cw E, or equivalently, Z — )i, 

(7) ^ (5): Suppose that M has the orthogonal property, and let N be the set 
of GMVDs graph generated from G[M). By Lemma 2, C N . For the other 
inclusion, let X y be a GMVD in and let Z = J\f — XY . By definition of 
y , there is no edge in G[M) connecting a node in y to a node in Z. Thus every 
attribute of Y is orthogonal to every attribute of Z. Then by the orthogonality 
property, E is orthogonal to Z, and X — E is in M + . 

(5) ^ (3): Obvious. 

(5) ^ (4): Let G be the graph that generates M + . Attributes Ai and Aj are split 

by M if and only if the edge (A^, Aj) is not in G. But also, A^ and Aj are split 

by M if and only if the edge (A^, Aj) is not in G{M), Therefore, G = G{M), 

(4) ^ (3): Obvious. 



References 

1. Beeri, C. and Lagin, R. and Maier, D. and Yannakakis, M.: On the Desirability of 
Acyclic Database Schemes. J. ACM 30(3) (1983) 479-513 

2. Cooper, G.L.: The Computational Complexity of Probabilistic Inference Using 
Bayesian Belief Networks. Artificial Intelligence. 42 (1990) 393-402 

3. Dagum, P., Luby, M.: Approximating Probabilistic Inference in Bayesian Belief 
Networks is NP-hard. Artificial Intelligence. 60(1) (1993) 141-153 

4. Lagin, R.: Multivalued Dependencies and a New Normal Lorm for Relational 
Databases. ACM Transactions on Database Systems. 2(3) (1977) 262-278 

5. Hajek, P., Havranek, T., Jirousek, R.: Uncertain Information Processing in Expert 
Systems. CRC Press. (1992) 

6. Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible 
Inference. Morgan Kaufmann. San Lrancisco, California. (1988) 

7. Shafer, G.: An Axiomatic Study of Computation in Hypertrees. University of 
Kansas. School of Business Working Papers (232). (1991) 

8. Studeny, M.: Conditional Independence Relations Have No Linite Complete Char- 
acterization. Eleventh Prague Conference on Information Theory, Statistical Deci- 
sion Loundation and Random Processes. (1990) 

9. Wong, S.K.M., Wang, Z.W.: On Axiomatization of Probabilistic Conditional Inde- 
pendence. Tenth Conference on Uncertainty in Artificial Intelligence. (1994) 591— 
597 

10. Wong, S.K.M., Butz, C.J., Xiang, Y.: A Method for Implementing a Probabilistic 
Model as a Relational Database. Eleventh Conference on Uncertainty in Artificial 
Intelligence. (1995) 556-564 

11. Wong, S.K.M.: Testing Implication of Probabilistic Dependencies. Twelfth Confer- 
ence on Uncertainty in Artificial Intelligence. (1996) 545-553 

12. Wong, S.K.M.: The Relational Structure of Belief Networks, (submitted for publi- 
cation) (1997) 

13. Wong, S.K.M.: An Extended Relational Data Model for Probabilistic Reasoning. 
Journal of Intelligent Information Systems. 9 (1997) 181-202 




A New Qualitative Rough- Set Approach to 
Modeling Belief Functions 



Mieczyslaw A. Klopotek, Slawomir T. Wierzchon 

Institute of Computer Science, Polish Academy of Sciences 
Warszawa, Poland 
e-mail: klopotek,stw@ipipan. waw.pl 



Abstract. The paper presents a novel view of the Dempster-Shafer 
belief function as a measure of diversity in relational data bases. The 
Dempster rule of evidence combination corresponds to the join operator 
of the relational database theory. This rough-set based interpretation 
is qualitative in nature and can represent a number of belief function 
operators. 



1 Introduction 

A case-based interpretation of Mathematical Theory of Evidence (MTE) has 
been a hot issue for a long time (compare discussions in [1]). Several models 
have been proposed, including rough set theory based ones (see an overview in 
[8]). However, none seems to be both complete and intuitively simple [6]. Eailure 
to achieve this goal is attributed to relying on frequencies [6], but so far a non- 
frequency interpretation appears to be missing. Below we present a modification 
of a decision-table based rough-set interpretation [5], in which we abandon object 
identities. The new measure of support of a decision does not rely on the number 
of records containing the decision, but rather the number of records with distinct 
information part. This means that instead of frequencies we use diversity of 
support. Such an approach seems to be reasonable in cases where the decision 
table does not reflect the actual frequencies of decision situations, but is meant 
to present their diversity. The new measure of support will be higher for more 
universal types of decisions The new non-frequency interpretation fulfills the 
requirement to be qualitative in nature and still to be case-based. Its appealing 
nature is illustrated by some examples. 

We assume familiarity with basic concepts of mathematical theory of evi- 
dence [3], rough sets [8], decision tables [5], SQL language, in which the new 
interpretation has been implemented by the authors, and relational databases 
in general [7]. 

Denotation: Let a tuple /x mean a function fi \ A ^ DOM[A)^ with 

A being a set of attributes Aj, DOM[Aj) being the domain of the attribute 
Aj, DOM [A) = y]j^,^j^DOM[Aj). A be called the scheme of /x, A = 5(/x). 

^ Shafer [1] recalls the fact that MTE was a generalization of legal reasoning rules. In 
that case the new interpretation would mean ” discarding” witnesses that give suspi- 
ciously similar testimonies and would count those ones that differ on non-essential 
details reflecting the usual subjective impressions of individuals. 
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A relational table TAB be any set of tuples with identical scheme. This com- 
mon scheme be denoted by S{TAB). Let fi[R] with R C 5(/i) denote the re- 
striction of the tuple /x to the scheme R: fi[R] = {{Aj^ajk)\Aj ^ R A ajk — 
fi[Aj)}. The restriction of a relational table TAB to R^ denoted TAB[R]^ be 
defined TAB[R] = {fi[R]\fi G TAB}. A relational join of two relational ta- 
bles TABi^TAB 2 be defined as: TAB\ 0 TAB 2 = {/xi U /X 2 I/X 1 G TAB\^fi 2 G 
TAB2 A 

A decision table is a relational table in which we split the scheme into two 
distinct parts: the information part and the decision part. 

Let card{SET) denote the cardinality of the set SET. 

2 New Interpretation 

Let us define the plausibility PItab{SET) derived from a decision table TAB 
with decision variable D and the set I of information variables as: PItab{SET) = 
card{{fi[l]\fi G TAB A /x(T>) G S ET}) / card{T AB[1]) ^ implemented as 

create view tmpTAB(No) as select count (distinct 1 ) from TAB; 

create view plTAB as select count (distinct l)/No from TAB.tmpTAB where 

TAB.D in SET; 

Example 1 explains the detailed numerical procedure for calculation of PI from 
the above SQL expression. 



Table 1. Decision table: BUILD. I - the firm; D - the object to be erected. 





/ 


D 


1. 


ABD A.G. 


center 


2. 


LQR Inc. 


school 


3. 


PTS Ltd. 


center 




PTS Ltd. 


restaurant 


4. 


XYZ Inc. 


center 


5. 


ZZZ Ltd. 


restaurant 




ZZZ Ltd. 


school 



Example 1. Assume a public offering for erection of buildings of a school, a 
restaurant and a shopping center where the offers presented by various firms 
have been summarized in the decision table BUILD (tab. 1) with "information 
part” (I) - the firm and "decision part” (D) - the object to be erected. The 
domain of the decision variable D is {center, restaurant, school}. What is the 
share of firms that would build either the school or the restaurant ? To answer 
this question we need to calculate the plausibility Pl( {school, restaurant}) from 
this table. There are 7 cases (rows) in the dataset. But there are only 5 cases 
with distinct information part (firms) I. And there are only 3 cases with decision 
either school or restaurant with distinct information part I (LQR Inc., PTS 
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Ltd., ZZZ Ltd). So the plausibility is equal^ Pl( {school, restaurant}) =3/ 5. One 
can check that Pl( {school}) =2/5 and Pl( {restaurant}) =2/5. 

Theorem 1. The function PItab{SET) derived from a decision table TAB 
with decision variable D and the set I of information variables is a plausibility 
function Pl{SET) in the sense of Dempster-Shafer theory. 

For the proof see [2]. 

Relational views for other MTE measures may be also derived (numerical 
illustration of these SQL-based definitions is given in example 2): 

BeY\eiBelTAB{SET) = l-card({/x[I] |/x G TABApi{D) ^ SET})/ card{TAB[l]), 
implemented as 

create view belTAB as select 1- count (distinct l)/No from TAB,tmpTAB where 
not (TAB.D in SET); 

Commonality Qtab{SET) = card{{ia[l]\\/(ieSET /^[I] U {{D,d)} G TAB}) 
/ card{T AB[1]) ^ implemented as 

create view tmplTAB(CN) as select count (distinct D) from TAB 
where TAB.D in SET group by I ; 

create view qTAB as select countC^)/No from tmpTAB,tmplTAB where 
CN=card(set); (card() is a function counting the elements of the set passed 
as its argument) 

Basic belief assignment mTAB{SET) = card{{ia[l]\\f(ieSET /^[I] U {{D^d)} G 
TAB} f\\/d^SET /^[I] U {{D,d)} ^ TAB})/card{TAB[l])^ implemented as 

create view tmpllTAB(l,D) as select TAB.l, TAB. D+XX.D from TAB, TAB 
XX where TAB.D in SET and XX.l=TAB.l; 

create view tmpl2TAB(l,CN) as select 1 , count (distinct D) from tmpllTAB 
group by I; 

create view m as countC^)/No from tmpTAB, tmpl2TAB 
where CN=card(SET)*card(SET); 

Example 2. From tab. 1 we easily calculate that: 

Commonality Q( {school, restaurant }= 1/5 (Number of firms ready to build either 
the school and the restaurant: ZZZ Ltd). 

Belief Q({school, restaurant}=2/5 Number of firms ready to build nothing but 
the school or the restaurant (LQR Inc., ZZZ Ltd) 

bpa - No of firms exactly offering erecting of: m({school, restaurant}=l/5 (ZZZ 
Ltd), m({restaurant}=0 (none), m({school}=l/5 (LQR Inc.) 

Theorem 2. The functions BcItab {SET), Qtab{SET), mTAB{SET) derived 
from a decision table TAB with decision variable D and the set I of information 
variables are belief, commonality, basic probability /belief assignment functions 
resp. Bel{SET),Q{SET),m{SET) in the sense of Dempster-Shafer theory. 

^ Under Skowron/Busse interpretation [5] we get a different PI value: PI = 4/7. The 
difference stems from the fundamental difference between frequency [5] and relational 
(ours) view. 
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3 Reasoning as Selection of a Snbtable 

A natural way to understand a belief function BcItab conditioned on evidence 
B (denoted BelTAB{-\\B)) is to select from TAB records fitting the condition B 
TABb = {/^Ir ^ TAB A G 5}, implemented as 

create view TAB.B(l,D) as select 1,D from TAB where TAB.D in B; 

and to calculate BelxABiMB) as the belief function BcItab -b{-) over TAB_B. 
In our example, Bel BUiLD{-\\{school^ restaurant}) is just Bel calculated from 
tab. 2. Our conditional belief function matches perfectly the Shafer’s definition 
of Bel{.\\B) = Bel 0 BcIb^ 



Table 2. Relational modeling of reasoning in MTE: 

Bel{.\\{school^ restaurant}) . 

Prescription select I,D from BUILD where D=school or D= restaurant 





/ 


D 


1. 

2. 

5. 


LQR Inc. 
PTS Ltd. 
ZZZ Ltd. 
ZZZ Ltd. 


school 

restaurant 

restaurant 

school 



Pl{{restaurant}\\{school, restaurant}) = 2/3{PTS Ltd.,ZZZ Ltd.) 
m{{restaurant}\\{schoohrestaurant}) = 1/3{PTS Ltd.) 



4 Combination as Relational Join 

Let us calculate the table FINISH as a join of BUILD and EQUIP (tab. 3) over 
the common column D so that the new decision table has as its decision column 
D and as its information part 1,12 (tab. 3). 

create table FINISH (I,I2,D); 
insert into table FINISH from 

select I,I2,D from BUILD, EQUIP where BUILD. D=EQUIP.D; 

Notice that in BUILD (tab.l), there were 5 cases with distinct information part, 
in EQUIP (tab. 3) - 3, and in BUEQ (tab. 3) there are only 10. You can easily 
check that BelpiNiSH = Bel build 0 Bel equip- 

Theorem 3. If the decision table DTI ( 11 , D) and DT2(12,D) with non-overlap- 
ping information parts 11,12 are combined by relational join operation DTI 0 
DT2, implemented as 

select 11 , 12 , DTI. D from DT1,DT2 where DT1.D=DT2.D; 
yielding table DT12(1,D) with I=I1UI2, then BelETi 2 = BcIeti 0 BelET 2 - 
For the proof see [2]. 
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5 Relational Marginalization and Decombination 

Notice that BUILD and EQUIP in our example are both in first normal form and 
the domain of the attribute D is identical in both tables. Therefore marginaliza- 
tion of FINISH over I,D [select distinct I,D from FINISH;) is exactly identical 
with BUILD. In general: 



Table 3. Combination of Independent Evidence. Example: Public offering: 
equipment for buildings of a school, a restaurant and a shopping center. De- 
cision table: EQUIP "Information part” - the firm, "Decision part” - the object 
to be equipped 





12 


D 


1. 


AAA GmbH 


school 


2. 


BBB Ltd. 


center 




BBB Ltd. 


restaurant 


3. 


CCC Inc. 


center 




CCC Inc. 


restaurant 



Independence of evidence means that no pair of firms (one from BUILD, one from 
EQUIP) refuse to cooperate on erecting and equipping an object. How many pairs of 
firms do we have to finish a set of objects mentioned in the offerings? Answer: create 
view BUEQ(I,I2,D) as select 1, 12, D from BUILD, EQUIP where BUILD. D=EQUIP. D , 
yielding the table below. 





1 


12 


D 


1. 


ABD A.G. 


BBB Ltd. 


center 


2. 


ABD A.G. 


GGG Inc. 


center 


3. 


LQR Inc. 


AAA GmbH 


school 


4. 


PTS Ltd. 


BBB Ltd. 


center 




PTS Ltd. 


BBB Ltd. 


restaurant 


5. 


PTS Ltd. 


GGG Inc. 


center 




PTS Ltd. 


GGG Inc. 


restaurant 


6. 


XYZ Inc. 


BBB Ltd. 


center 


7. 


XYZ Inc. 


GGG Inc. 


center 


8. 


ZZZ Ltd. 


BBB Ltd. 


restaurant 


9. 


ZZZ Ltd. 


GGG Inc. 


restaurant 


10. 


ZZZ Ltd. 


AAA GmbH 


school 



Pl({school, restaurant})=8/10, Bel({school, restaur ant}) =3/ 10 



Theorem 4. If the information part I of the decision table DT(1,D) can he 
split into two such parts 11,12 that II U 12 = I and II fl 12 = 0 and the relation 
DT is identical with DT1®DT2, implemented 

select 11 , 12 , DTI. D from DT1,DT2 where DT1.D=DT2.D; 

where DTI and DT2 are DT1=DT[11,D], DT2=DT[12,D], implemented 

create view DTI as select distinct 11 , D from DT; 
create view DT2 as select distinct 12 , D from DT; 
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that is there is a multivariate dependeney between II and 12 given D, then 
Bel£)T — Bel (B Bel 11 - 2,0 . 

Proof. Follows directly from theorem 3. 

Let consider the unnormalized MTE measures of decision tables Bel'rpj^^^ 

Q'tab-! such that = Jtab -cardiT AB^^) (card - number of distinct 

rows, f - m or Bel or PI or Q) and the unnormalized combination operator 
0 ' such that Bel'^^ = Bel'^^ 0' Bel'^^ is defined as follows: ^ 

^B c-A=Bnc '^Ei i^) ' '^'e2 (^)* 

What may be more surprising, a kind of a reverse theorem holds: 

Theorem 5. The information part I of the decision table DT(I,D) can be split 
into two such parts 11,12 that II U 12 = / and II H 12 = 0 and Bel'j^rj. = 
Bel'j^rp^i^^D Bel' ^ 12 , D if and only if the relation DT is identical with DTI 0 
DT2, implemented 

select 11 , 12 , DTI. D from DT1,DT2 where DT1.D=DT2.D; 

where DTI and DT2 are DT1=DT[11,D], DT2=DT[12,D], implemented 

create view DTI as select distinct 11 , D from DT; 
create view DT2 as select distinct 12 , D from DT; 

that is there is a multivariate dependency between II and 12 given D, 

For the proof see [2]. 

6 Multivariate Beliefs 

Our definition of MTE measures is easily extended to tables with multiple de- 
cision variables which may model multivariate belief distributions (in all the 
decision variables). It is trivial to see that dropping a decision variable does 
not diminish the diversity of the information part. Hence dropping a decision 
variable Di from the set of decision variables D has the same effect as drop- 
ping a variable (so-called marginalization or projection operation |) in the be- 
lief function. That is for any set B of decision vectors in variables D-{Di}: 

See tab. 4 for 

an example. 

The operator of projection [ should be understood as the MTE projection 
operator applied to a belief function. Let DTM be a decision table with decision 
variables D1 and D2. Let the information part consist of two disjoint parts II 
and 12. Let us consider the following views: 

create view DTMl ( 11 , Dl) as select distinct 11 , D1 from DTM; 
create view DTM2 ( 12 , D 2) as select distinct 12 , D 2 from DTM; 
create view DTM12 as select distinct 11 , 12 , D1,D2 from DTM1,DTM2; 

If now the table DTM12 is relationally identical with DTM, then we shall say 
that the decision variables Dl and D2 are independent in the decision table 
DTM. It is not surprising that: BcI^^tm = Bel^^j^ 0 Bel\^j^. This means 
that independence of decision variables in a decision table implies independence 
of variables in the corresponding belief function. See tab. 5 for an example. 
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Table 4. Multivariate MTE. A table and its projection 



Decision table MADEOF 





1 


D 


D1 


1. 


ABD A.G. 


center 


wooden 


2. 


LQR Inc. 


school 


stone 




LQR Inc. 


school 


wooden 


3. 


PTS Ltd. 


center 


stone 




PTS Ltd. 


center 


wooden 




PTS Ltd. 


restaurant 


stone 


4. 


XYZ Inc. 


center 


stone 


5. 


ZZZ Ltd. 


restaurant 


stone 




ZZZ Ltd. 


restaurant 


wooden 




ZZZ Ltd. 


school 


wooden 



Its projection onto 1,D: 
select distinct I,D from MADEOF 





I 


D 


1. 


ABD A.G. 


center 


2. 


LQR Inc. 


school 


3. 


PTS Ltd. 


center 




PTS Ltd. 


restaurant 


4. 


XYZ Inc. 


center 


5. 


ZZZ Ltd. 


restaurant 




ZZZ Ltd. 


school 



m({ (school, woo den) }) = 1/5, m({ (school, stone) })=2/5 

({ (school) }) =m({ (school,st one) }) +m({ (school, wooden) }) = 1/5 



Table 5. Variable Independence 



D2 - heating 





12 


13 


D 


D2 


1. 


AAA GmbH 


EG 


school 


electric 


2. 


AAA GmbH 


GG 


school 


gas 


3. 


BBB Ltd. 


EG 


center 


electric 




BBB Ltd. 


EG 


restaurant 


electric 


4. 


BBB Ltd. 


GG 


center 


gas 




BBB Ltd. 


GG 


restaurant 


gas 


5. 


GGG Inc. 


EG 


center 


electric 




GGG Inc. 


EG 


restaurant 


electric 


6. 


GGG Inc. 


GG 


center 


gas 




GGG Inc. 


GG 


restaurant 


gas 



Variables D and D2 are independent [Bel = Bel^^ 0 Bel^^‘^) because the above table 
represents a cross product of the tables (without common columns) 





12 


D 


1. 


AAA GmbH 


school 


2. 


BBB Ltd. 


center 




BBB Ltd. 


restaurant 


3. 


GGG Inc. 


center 




GGG Inc. 


restaurant 





13 


D2 


1. 

2. 


EG 

GG 


electric 

gas 
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Let DTX be a decision table with decision variables Dl, D2 and D3. Let 
the information part consist of two disjoint parts II and 12. Let us consider the 
following views: 

create view DTXl (ll,Dl,D3) as select distinct 11,D1,D3 from DTX; 
create view DTX2 (12,D2,D3) as select distinct 12, D2,D3 from DTX; 
create view DTX12 as select distinct 11,12,D1,D2,D3 from DTXl, DTX2 where 
DTX1.D3=DTX2.D3; 

If now the table DTX12 is relationally identical with DTX, then we shall say 
that the decision variables Dl and D2 are independent given D3 in the decision 
table DTX. It is not surprising that: BcIdtx = Bel ^ But 

this means that the variables Dl and D2 are independent given D3 in the belief 
function BcI^tx in sense of Shenoy’s VBS. See tab. 6 for an example. 



Table 6. Conditional Variable Independence 



14 - painting company, DA - color, D5 - finish. 





12 


14 


D 


D5 


D4 


r 


AAA GmbH 


Messer 


school 


wood 


green 




AAA GmbH 


Messer 


school 


wood 


red 




AAA GmbH 


Messer 


school 


plastic 


green 




AAA GmbH 


Messer 


school 


plastic 


red 


2. 


BBB Ltd. 


Messer 


center 


metallic 


white 




BBB Ltd. 


Messer 


center 


metallic 


yellow 




BBB Ltd. 


Messer 


center 


marble 


white 




BBB Ltd. 


Messer 


center 


marble 


yellow 


3. 


BBB Ltd. 


Gabel 


restaurant 


wood 


red 


4. 


GGG Inc. 


Messer 


center 


metallic 


white 




GGG Inc. 


Messer 


center 


metallic 


yellow 




GGG Inc. 


Messer 


center 


laminated 


white 




GGG Inc. 


Messer 


center 


laminated 


yellow 


5. 


GGG Inc. 


Gabel 


restaurant 


plastic 


red 



In Bel of the above table variables DA and D5 are conditionally independent given D 
because the above table is a relational join of the tables below (with D as a common 
column) 





12 


D 


D5 


1. 


AAA GmbH 


school 


wood 




AAA GmbH 


school 


plastic 


2. 


BBB Ltd. 


center 


metallic 




BBB Ltd. 


center 


marble 




BBB Ltd. 


restaurant 


wood 


3. 


GGG Inc. 


center 


metallic 




GGG Inc. 


center 


laminated 




GGG Inc. 


restaurant 


plastic 





14 


D 


D4 


1. 


Gabel 


restaurant 


red 


2. 


Messer 


school 


green 




Messer 


school 


red 




Messer 


center 


white 




Messer 


center 


yellow 
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On Stability of Oja Algorithm 
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Abstract. By elementary tools of matrix analysis, we show that the 
discrete dynamical system defined by Oja algorithm is stable in the ball 
A(0, 81/64) if only gains f3n are bounded by (2R)“^, where B = and 
b is the bound for the learning sequence. We also define a general class of 
Oja’s systems (with gains satisfying stochastic convergence conditions) 
which tend to the infinity with exponential rate if only their initial states 
are chosen too far from the zero point. 



1 Introduction 

Oja algorithm ([6]) is a neural type iterative scheme 

'^n+l ~ '^n (1) 

used for stochastic approximation of a principal vector w G for the given 
random variable X : ^ with zero mean value (£^[X] = 0). 

The sample G R^ is a value of X chosen at the n-th moment of discrete 
time, randomly and independently of other samples. The real positive number 
f3n is called the gain or the learning rate coefficient at time n. The function 
f{x^w) has the following form: 

f{x,w) = y{x - yw) (2) 

where y = x^w. The vectors are written in the column form and ^ denotes the 
matrix transposition. Components of the vector w are also called weights as they 
can be interpreted as weights of one linear neuron with output y — x^w. This 
neuron is taught to maximize the variance of the random variable Y = X^w. 

Contrary to numerical algorithms for eigenvector computation, the Oja al- 
gorithm does not require an estimate of the covariance matrix C - the principal 
vector is modified incrementally after receiving a new data vector x. This adap- 
tivity to current input data is the basic reason for the widespread use of this 
approach. 

The algorithm is trivial to be implemented and results are enough accurate 
for such applications as image compression and pattern recognition (for instance 

The work was sponsored in part by University (Rector) priority grant New materials. 

L. Polkowski and A. Skowron (Eds.): RSCTC’98, LNAI 1424, pp. 354—360, 1998. 

(d) Springer- Verlag Berlin Heidelberg 1998 
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the classification of materials based on textures extracted from digital images of 
material surfaces). 

The only drawback of this approach is the need for large training sequences. 
For instance to find useful approximation of the principal subspace in handwrit- 
ten digit recognition we had to use about 500 training examples per one digit. 
Using only 50 examples had reduced the recognition rate from 99.4% to 96.5% 
([7]). In other words we need not only representative sequence for the random 
variable but this sequence must be delivered several times to reach adequate 
accuracy. 

The iterative scheme (1) with the function / defined by (2) is also known 
as the Oja’s learning rule. Its convergence was shown by reducing the difference 
equation to the corresponding ordinary differential equation. Namely, using the 
famous theorem of Kushner and Clark [4] many authors (e.g.:[l]) prove the 
stochastic convergence with probability one (a. 5. - almost sure) to the principal 
vector of X under the following conditions for gains: 

oo oo 

^ /?n = 00, (3) 

n n 

Additionally it is required that the covariance matrix C = E[XX'^] for X has 
the principal vector e± with the eigenvalue Ai of multiplicity one (i.e. Ai > A 2 ) 
and it is not orthogonal to the initial weight vector wq (i.e. WqCi % 0). 

It is interesting that in their proofs, all authors using the stochastic ap- 
proximation ignored the basic condition of this theory: the boundeness of the 
stochastic sequence Wn- Therefore the proofs were incomplete. 

Recently there were some efforts to find a proof independent of the Kushner 
and Clark stochastic approximation theory using only general facts from mar- 
tingale theory. The work of Dufio [2] includes such a proof. She has shown a.s. 
convergence of the Oja algorithm but it seems that the overhead of theorems 
it is based on, is comparable with the previous approach. She has also proved 
the stability of the algorithm by showing that any trajectory of Oja’s dynamical 
system beginning from the ball iF(0, 5.0) (with center in zero, with radius 5) 
will stay within this ball provided the sequence f3n is bounded by (25)“^, where 
B = and b is the bound on the learning sequence: 

5 A max llxnll, B = msix\\xn\\‘^ (4) 

n n 

In this work we improve the above result to the ball of radius about four 
times less. Namely we show in the section 2 that the Oja algorithm is stable in 
the ball iF(0, 81/64). 

2 Stability of Oja algorithm 

Let us consider the Oja algorithm in the following classical form which follows 
from (1) and (2) by eliminating variable y: 

Wn+l = W'n + PnXnWn ixn ~ ( 



(5) 
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The equivalent form we obtain by multiplying first term in the parenthesis 
by x'^Wn from the right and the second one by vo^Xn from the left: 

Wn+l = Wn + Pn^XnX^Wn - W^XnX^WnWn) (6) 

Extracting by introducing identity matrix / G we get: 

Wn+l = Wn + Pn{XnXn ~ WnXnXnWnI)Wn (7) 

Denoting the matrix by we have: 

'^n+l ~ '^n T /3n{Cfi (S) 

Using equality \\w\\‘^ = w^w we obtain: ||u;n+i|P = 

IknlP + 2/?n(l - \\Wn\f)WnCnWn + Plw^Clwn + Pl,{\\Wn\\^ ~ 2){w^CnWnf 

(9) 

Assuming the bounds defined in (4), by simple manipulations we get the 
following useful facts: 

{WnCnWnf = {WnClwn)\\Wn\\^, ( 10 ) 

WnClwn = \\Xn\fiWnCnWn) < B ■ WnCnWn, ( 11 ) 

WnCnWn = l^n |P ||^t'n |P < B ■ (12) 

The following theorem gives us a tight bound on the stability region of Oja’s 
dynamical system. 

Theorem 1. 

If for each n > 0 \\xn\\‘^ < B a.s. and Pn < ^ then for each n > 0 ||u;n|P < 
81/64 a.s. if only \\wo\\‘^ < 81/64. 

Proof: We show the thesis by induction on n. In the inductive step we consider 
three cases: 

1- llw'niP < 1: 

In the formula (9) we ignore the last term which is negative, next we apply 
the inequalities (11,12) to the third term, the inequality (12) to the second 
one and finally the one bound the first term: 

lkn+l|P < llw'niP + 2/?n(l “ ||w'n|P)w'n C'nW'n + C'n^'n 

< llw'niP + 2/?„(l - ||u;„||2)||tt;„||2B + PlB^\\Wn\\^ 

< llw'niP + (1 - ||w'n|P)||^t'n|P + illw'niP < 81/64 

The last line above was obtained by using the assumption U 1/2 and 
computing the maximum value of the polynomial t + (1 — t)t + 0.25t in the 
interval [0, 1]. 
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2. 1 < llw^nlP < 1 + 

This time in the formula (9) we ignore the last term which is negative and 
next we apply the inequality (11) to the third term only: 

Ihn+llP < llw'niP + 2/3„(l - \\Wnf )w'^CnWn + iSlw^C^Wn 

< llw'niP + 2/?n(l - \\Wn\\‘^)WnCnWn + PlBWnCnWn (13) 

= Il'^nlP + “^PnW^CnWn (l ~ Il'^nlP + 

Applying the bound for jS^B we get: 

„o „ „o /5 „ „x „ /9 „ „x , 92 81 



kn+llP < llw'niP + IknlP O “ Ijw', 



i - 11^’ 



The bound 81/64 is the maximum value of the polynomial t(9/4 — t). Hence 
the upper bound 81/64 is held. 

3. 1 + ^ < ||u;„||2 < 81/64: 

As 81/64 is less than 2 we can use the last inequality in (13) in which we 
can drop the negative second term: 



kn+lll^ < IknlP + ‘2-PnWnCnWn (l ~ Hw^nlP + < l^nlP < I7 



As a byproduct of the above proof we get the corollary about monotonic 
behavior of the norm ||'iCn|| in certain regions: 

Corollary 1. 

1. If \\wn\\ < 81/64 and > 1 + then \\wn\\ > ||ic^+i||; 

2. //‘i|n;nii < 1 then \\wn\\ < ||n;n+i||; 

Proof: The first fact is included in the inequality (14). The second requires more 
careful bounding the right side of the equation (9). Firstly we break the factor 
lIrCniP — 2 into ||u;n|P — 1 and —1. Multiplying, substituting the equality (10) 
and reordering the terms produces: 



IhniP 


+ 2/?„(l 


- ikip: 


)w'^CnWn-\- 


+/^n(l 


- IKIP 


)wlClw 


n /^n(^ ll^nll 


IkniP 


+ 2/?„(l 


- llw'niP’ 


)w'^CnWn + /?^(1 ■ 


-m 


- IKf 


)BwlCn 


||n^n II 


IknlP 


+ /^n(l “ 


- llw'nlHl 




2/?n(l 


“ l^^nlP) 







\Wn\\^)w'^Clwn- 



2 
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The last inequality is valid as all remaining terms are nonnegative. In particular 
Bpn\\wn\\^ <1/2 what implies positivity of the factor 1 — 5/?n||u;n|P/2. 

□ 



3 Unbounded trajectories in Oja scheme 



In this section we construct a general class of Oja’s dynamical systems defined 
by the recurrent equation (5) which exhibits very fast convergence to infinity 
despite the training sequence is bounded. 

Let e G be a unit length vector and two real constants 0 < A < A <1. 
Assume that for each n > 0, G span(e) and A < \\xn\\ < B. Assume also that 
wq G span(e). Then obviously for each n > 0, Wn G span(e), too. As Xn and Wn 
are collinear then we have immediately: 

WnCnWn = = IknlPl^nlP (15) 



Let us assume the gain coefficients of the form 



Pn 



2B 



< — 

n 2B 



(16) 



We see that pn satisfies the assumption of stability Theorem 1. 

Denote the difference ||n;n+i|P — ||'^n|P by Z\||n;n|p. Then from the equation 
(9) we have: 

+PlwlClwn + Pl{\\Wnf - 2){wlCnWnf (17) 



By dropping the first and the third nonnegative terms and next using the equa- 
tion (15) and bounds for the sequence we get the lower bound for ||iCn|P A 2: 



> -‘^Pn\\Wn\\^\\Xnf + Pl{\\Wnf - 2) || 11"^ l^n 11"^ 

> -‘^Pniwni'^B + Pl{\\Wni'^ ~ 2)\\Wni'^A^ 

> Pn\\Wn\\\-2B + PnA^Wnf ~ 2pnA^) 

Hence we get the lower bound Z\||t<;„|p > ||tCn|P provided 

/Inllw'nll^ > 1 and — 2B + — 2pnA^ > 1 . 

The second condition is equivalent to 

II ||2 ^ 1 < < 2/?7^A^ 

^ Atv 



(18) 



(19) 



Note that assumed condition A < 1 implies that the right side of the in- 
equality (19) is greater than l/pn- If implies the first condition /?n||'^n|P A 1. 
Therefore to have ||rcn+i|p > 2||rCn|P it is enough to satisfy the condition (19). 
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We have to show that this condition is true for all n > 0 if only it is true for 
n = 0. The inductive step assumes that 



\\Wn+l\\ >'A\wn\\ and ||u;n|| > 



2 1 + 2B T 2j3jiA^ 






Hence 



Ihn+ill >2(25 + n) ^2 >(25 + n+l) > 



> 



1 T 2B T 2f3ji^iA‘^ 

/?n+lA2 



Therefore both relations 



llw'n+ilP > 2||u;n||^ and 



2^1 + 2B + 

- 



are true for all n > 0. 

In conclusion, choosing as wq = ae where 



we get 



a > 



2B A 452 + 2A2 

I 2 



\\Wr, 



> 2”/^ 



a . 



This proves unboundness of the weight sequence in Oja algorithm. 



4 Conclusions 

By elementary tools of matrix analysis, we have shown that the discrete dynam- 
ical system associated with Oja algorithm is stable in the ball AT(0, 81/64) if 
only gains (3^ are bounded by (25)“^, where B = and b is the bound for the 
learning sequence. We have also defined a general class of Oja’s systems (with 
gains satisfying stochastic convergence conditions) which tend to the infinity 
with exponential rate if only their initial states are chosen too far from the zero 
point . 

Assuming that zero vectors are skipped in the training random sequence, the 
above results imply the following heuristic rule for the initial choice of the weight 
in Oja algorithm: 



Wo 



xo 

Ikoll ■ 




360 



R. Sikora, W. Skarbek 



References 

1. Diamantaras, K.L, Kung, S.Y. (1995) Principal component neural networks - Theory 
and applications, John Wiley & Sons, Inc. 

2. Duflo, M. (1997) Random iterative models, Appl. Math., 34 Springer. 

3. Karhunen, J. (1994) Stability of Oja’s PC A subspace rule. Neural Computation. 

4. Kushner, H.J., Clark, D.S. (1978) Stochastic approximation for constrained and 
unconstrained systems, Appl. Math. Sci., 26. Springer. 

5. Kushner, H.J., Yin, G. (1997) Stochastic approximation and applications. Springer. 

6. Oja, E. (1982) A simplified neuron model as a principal component analyzer, J. 
Math. Biology, 15, 267-273. 

7. Skarbek, W., Ignasiak, K. (1997) Handwritten digit recognition by local principal 
components analysis, ISMIS’97, International Symposium for Methodology of Intel- 
ligent Systems, 217-226, Charlotte, USA, October 1997. 




Transform Vector Quantization of Images 
in One Dimension 



Remigiusz J. Rak 

Warsaw University of Technology, PL Politechniki 1, 
PLOO-660, Warsaw, Poland. 



Abstract. In this paper there is enclosed a description of the hybrid sys- 
tem for black and white (256 by 256 pixels) images compression. The sys- 
tem includes the following procedures: image decomposition (8x8, 16x16 
blocks), DOT transformation, ” zig-zag” scanning, product code vector 
quantization (one- dimensional block) and a bit allocation. The stan- 
dard LEG algorithm for codebook design has been enriched with the 
simulated annealing procedure for avoiding the local minima. Standard 
vector quantization in two-dimensional transform space has been inves- 
tigated for comparison. 



1 Introduction 

Compression of 2D signals establishes the most popular application of vector 
quantization. A natural way to apply basic VQ to images directly is to decom- 
pose a sampled image into rectangular 2D blocks of fixed size (M by N points - 
pixels) and then use these blocks as vectors. It seems that much simpler analysis 
could be done by moving along rows (horizontally), reading pixel values and 
then using ready, verified algorithms in ID space. But in that case, 2D intercor- 
relations between pixels would be lost irreversibly. After moving into transform 
(frequency) domain, where the transform coefficients are much less correlated, 
such a procedure seems to be quite reasonable. However because of a very charac- 
terized amplitude distribution of the coefficients (at the plane), using of ” zig-zag” 
scanning is preferred. As a good example there could be shown JPEG standards 
(DOT, ”zig-zag”, SQ), where the property of energy compactness was first suc- 
cessfully exploited in practice. Investigations realized by the author has proved 
that the efficient hybrid method for image compression should consist oLimage 
decomposition, 2D-transformation, ” zig-zag” scanning of the coefficients, prod- 
uct code vector quantization and a bit allocation. 

2 System Description 

Independent coding of amplitude and phase of the spectrum in a case of DFT, 
effectively implemented in speech coding (phase degradation) [12], does not bring 
good results in the area of picture coding. Quite opposite, there is needed a higher 
precision in the quantization of phase coefficients. So that, the investigations 
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has been reduced to the DCT domain. There has been taken into consideration 
two versions of the coding system: first in two-dimensional and second in one 
dimensional space of the frequency domain. Simplified block diagrams of two 
versions of the system are presented in figures 2 and 3 properly. The symbol 
denotes ” zig-zag” scanning. 



TD C 







Y 








Y’ 






DCT-2D 




VQ 

^2D 




VQ'^ 

^2D 




DCT-2D' 



X’ 



Fig. 1. The block diagram of the system: DCTVQ-2D 







Y 














DCT-2D 




z 




VQ 

^ID 




vq‘ 

^ID 



iz 






DCT-2D 
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Fig. 2. The block diagram of the system: DCTVQ-ID 



2.1 Analytical Description 

Let us assume that x(m, n) denotes a two-dimensional data array (real numbers) 
with size MxN. The two-dimensional pair of orthogonal transforms (forward and 
inverse) is defined as: 



M-lN-l 

y{k,l) = EE x(m, n)kernf{k^ /, m, n) 

m=0 n=0 

( 1 ) 

M-lN-l 

x{k^ V) = EE y(m, n)kerrif{k^ l^rn/n) 

k=0 1=0 



for A:, m = 0, 1, ..., M — 1 and n, / = 0, 1, ..., A — 1. 
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In place of kernels it is necessary to introduce proper trigonometric expres- 
sions. For the Discrete Cosine Transform: 



kerufiyk^ /, m, n) 



2a[k)a[l) 



(2m + l)A:7r 
2M 



(2n + 1)/7T 
27V 



( 2 ) 



where: o;(i) = ^ for i = 0, o;(i) = 1 for the others and kerrif = kerrii 



Procedures implemented for the transform computing were based on the 
modification of the well known FFT algorithm. Standard 256x256 pixel images 
encoded with the resolution r=8bpp: bird, cat, couple, face, hat, jet, and peppers 
were used for system design (training sequence) and baboon, lena for checking 
system parameters. Before the transformation entire image was decomposed into 
different data blocks: 2x2, 4x4, 8x8 and 16x16. In a case of the small blocks (2x2, 
4x4 and 8x8) there was implemented a direct vector quantization in 2D space. 
In a case of the larger block sizes (8x8 and 16x16) the total values of pixels were 
64 and 256 properly. So that there was used a coding procedure with divided 
data blocks after ”zig-zag” scanning, in ID space. The last system (16x16) was 
quite compatible with a one implemented to speech signal [12]. The change of 
the achieved data format to the form adequate for implemented VQ algorithms 
has been done by using a logarithmic compandor (with an experimentally chosen 
curve) . 



2.2 Codebook Design 

For the codebooks design there have been used the LBG algorithm (based on 
GLA) equipped with simulated annealing procedure. The necessary conditions 
for optimality of the quantizer (iterative Lloyd algorithm) provide the basis 
for iteratively improving a given codebook. The Lloyd iteration begins with a 
vector quantizer having already its codebook, and the corresponding optimal 
nearest neighbor partition. So that, the first problem to be solved is to obtain 
an initial codebook. There are the variety of techniques for generation of an 
initial codebook. The simplest conceptual approach towards filling a codebook 
of N code words is to randomly select the code words according to the source 
distribution, which can be viewed as a Monte Garlo codebook design. In a case 
when there is an access to the training sequence, it is simply possible to select N 
first or randomly chosen training vectors. The Shannon source coding theorems 
imply that such a random selection will on the average yield a good code. 

The iterations start with an initial codebook, which is in the consecutive steps 
denoted by: Cm- Then, according to the nearest neighbor rule, there is realized 
partitioning of the entire training sequence into partitions Ri: 

Ri = {j e R -. d{y,Ci) < d{y,Cj)Jorj ^ i} (3) 

where: 

M N 

^(y 7 *^) ~ ^ ^ ^ ^ {ym,n ^m,n) 
m=l n=l 



( 4 ) 
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After that there are computed the centroids for each partition, the compo- 
nents of a new codebook Cm^i : 



Ci = cent[Ri) 



1 

\Rr 



liiil 
k = l 



( 5 ) 



The value of an average global error (for coding all the training sequence) is 
controlled in each iteration step: 



. IC'I \Ri\ 

^rn = 7 ^ d[yk^ Cj) ( 6 ) 

I I i=l k = l 

where|C| - denotes codebook size, and \Ri\ - the size of i-th training sequence 
partition. 

The Lloyd procedure is a kind of descent algorithm, meaning that each iter- 
ation always reduces average distortion but the new codebook is generally not 
very much different from the old one. So that, once an initial codebook is chosen, 
the algorithm will lead to the nearest local minimum. Since a cost function will 
generally have many local minima, better or worse, it is clear that the GLA does 
not have the ability to locate an optimal codebook. There are some methods 
reducing or eliminating the dependence of the solution on the initial codebook. 
In order to avoid local minima in a codebook design procedure, in this project, 
there have been used a simulated annealing procedure with the reaction to de- 
coder (SA- D). There were introduced some perturbations (noise to each 
codebook component in each iteration step: 



cr = c- + ^Tm) (7) 

The parameter called "temperature” {Tm) is used to refer to the noise vari- 
ance and a "cooling schedule" to refer to the sequence of the "temperature" 
values as a function of the iteration number m. 



Trr 



al = 



^ M 



ly 



( 8 ) 



where: 

" beginning value of the noise variance averaged for all vector components, 

- current value of the noise variance averaged for all vector components, 
rn - iteration number, 
p - integer value of power. 

The LEG algorithm for the training sequence including a simulated annealing 
process (SA-D type) is presented below: 
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1. Start with an initial codebook C = Co, m = 0. 

2. Traditional Lloyd iteration step: Cm Cm^i- 

3. Introducing perturbations to code vectors: T 

4. Global average distortion computing: Cm+i 

5. A comparison of the error reduction level AD / D to the thresh- 
old value e : AD/ D < e?, Decision: stop or go to step 2. 

In this project the "temperature” was defined as: 

= (9) 

Simulated annealing establishes rather complex and very time consuming 
process. A novel approach to the non convex optimization of VQ, determinis- 
tic annealing (DA), appears to capture the benefits of SA without adding any 
randomness in the design process. It is conceptually similar to the technique of 
fuzzy clustering. In DA the statistical description of randomness is incorporated 
into the cost function. 



3 Simulation Results 

The lena and baboon were implemented in a testing procedure. The first results, 
describing the influence of codebook sizes (CB) on the PSNR and the bit rate 
(BR) values versus vector dimensions (block sizes) are presented in table 1. 



Table 1. PSNR and BR versus block sizes for differ- 
ent codebook sizes: DCT-VQJena 



Codebook 

size 


2x2 


4x4 


8x8 


PSNR[dBJ/BR[bppJ 


64 

128 

256 


32.56 / 1.400 
34.73 / 1.748 
36.81 / 2.000 


26.53 / 0.375 
28.18 / 0.437 
29.67 / 0.500 


19.76 / 0.094 
21.05 / 0.109 
22.19 / 0.125 



The influence of vector sizes on the PSNR value (and other parameters) for 
a standard (CB=256) codebook size is presented in table 2. The presentation 
does not include the size 16x16 because it was not coded in a direct manner. 



Table 2. Coding parameters versus codebook sizes: 
DCT-VQ, CB=256, lena 



Block 

size 


BR 

[bpp] 


CR 


MSB 


RMSE 


NMSE 

[%] 


SNR 

[dB] 


PSNR 

[dB] 


2x2 


2.0 


4 


13.55 


3.68 


0.36 


24.41 


36.81 


4x4 


0.5 


16 


70.02 


8.36 


1.87 


17.23 


29.67 


8x8 


0.125 


64 


392.54 


19.81 


10.50 


9.78 


22.19 



The similar results were achieved for baboon (table 3). 
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Table 3. DCT-VQ, lena, baboon 



Block 

size 


BK, 

[bpp] 


lena 

PSNR[dB] 


baboon 

PSNR[dB] 


2x2 


2.0 


36.81 


34.87 


4x4 


0.5 


29.67 


27.83 


8x8 


0.125 


22.19 


20.34 



The block of size 16x16 is definitely to large for a direct vector quantization. 
So that there is recommended a procedure of ” zig-zag” scanning and partition- 
ing of the ID block. In this project the entire block has been partitioned into 16 
vectors with different sizes (16 points in average) resulting from the implemen- 
tation of the allocation procedure in a vector area. That case (16 vectors) gave 
in average bit rate of O.Sbpp, it means the same like in a case of 4x4 with direct 
VQ, described above. The simulations have proved that the coding quality of 
the system ID-VQ (16x16) is better then in a 2D case of 4x4. As an explanation 
it is enough to remind that in a case of larger data blocks there appears the dip- 
per data mixing effect (data decorrelation) in the transformation process. The 
vector quantization ”has less to do”. The similar experiment has been done for 
the case of the size 8x8. Each block has been partitioned into 16 vectors with 
sizes 4 points in average. The achieved bit rate was 2bpp. The simulation results 
seemed to be very promising too. Table 4 includes the final results. 



Table 4. A comp, of DCT-VQ-2D to DCT-VQ-ID, lena 



Block 

size 


BR 

[bpp] 


CR 


MSB 


RMSE 


NMSE 

[%] 


SNR 

[dB] 


PSNR 

[dB] 


2x2 


2.0 


4 


13.55 


3.68 


0.36 


24.41 


36.81 


4x4 


0.5 


16 


70.02 


8.36 


1.87 


17.23 


29.67 


8x8 


0.125 


64 


392.54 


19.81 


10.50 


9.78 


22.19 


8x8 (ID) 


2.0 


4 


11.33 


3.36 


0.30 


25.18 


37.59 


16x16 (ID) 


0.5 


16 


34.28 


5.85 


0.91 


20.37 


32.78 



Very important and difficult problems to overcome or eliminate are the blocky 
effects. Even the high resolution digital images are blocky when viewed at the 
large scale. The VQ image is however much worse. It could be partly seen in the 
set of pictures demonstrated in fig. 3. The blockness and the ’’sawtooth” effects 
along diagonal edges are apparent. They are especially meaningful in a case of 
large block sizes: 8x8 and 16x16. The implementation of a very simple filtering 
(smoothing by averaging pixel values) process around the block edges can im- 
prove the quality good enough. This was also introduced into the project but 
it did not influence the value of the PSNR and the results were examined in a 
subjective manner. 

The original image and the ones achieved after vector quantization for differ- 
ent vector (block) sizes and a codebook of 64 code vectors (CB=64) are presented 
in fig. 3. 
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4 Conclusion 

An algorithm of vector quantization of images in Discrete Cosine Transform 
domain was presented. The two-dimensional array of DCT coefficients was "zig- 
zag” scanning before the Vector Quantization. There were used large data blocks 
in order to achieve a better VQ performance. The encoding complexity of DCTVQ 
was reduced drastically by dividing the ID block into small vectors (product code 
with a bit allocation). In order to avoid local minima there was implemented sim- 
ulated annealing in a codebook design procedure. The achieved performance gain 
was about IdB in SNR. 

The majority of papers in the area of DCT VQ of images were published 
at the conference ICASSP and symposium PCS organized in Japan. The main 
representatives of them are papers [2] and [3]. Work [2] is devoted to the in- 
dependent vector quantization of RGB components of color images. There is 
implemented a decomposition of constant size block (8x8) into 15 diagonally 
taken ID - vectors and an adaptive bit allocation. Such a system is much more 
complex but is adequate to a direct comparison with the one proposed above. 
Generally there exists equality in the area of PSNR (+/-ldB). But the original 
system ID (16x16), proposed above, gives the results about 2dB better. 



References 

1. Abdelwahab, A. A., Kwatra, S.C.: Image Data Compression with Vector Quanti- 
zation in the Transform Domain. IEEE Int. Conf. on Commun. ICC ’86, Toronto, 
1986 

2. Aizawa, K.,Harashima, H., Miakawa, H.: Adaptive Discrete Cosine Transform Cod- 
ing with Vector Quantization for Color Images. Proc. ICASSP ’86, Tokyo, Japan, 
1986 

3. Aizawa, K.,Harashima, H., Miakawa, H.: Adaptive Discrete Cosine Transform Cod- 
ing with Vector Quantization. PCS’86 Picture Coding Symp., Tokyo, Japan, 1986 

4. Bellifemine, E., Picco, R.: 2D-DCT coding with Pyramidal Vector Quantization. 
Picture Coding Symp., Torino, Italy, 1988 

5. Cho, N.I., Lee, S.U.: A fast 4x4 DCT for the recursive 2-D DCT. IEEE Trans. 
Sign. Processing vol.40. Sept. 1992 

6. Clarke, R.J.: Digital Compression of Still Images and Video. Academic Press, 1996 

7. Elanagan, J.K., Morrell, D.R., Erost, R.L., Read, C.J., Nelson, B.E.: Vector Quanti- 
zation Codebook Generation Using Simulated Annealing. ICASSP, Glasgow, Scot- 
land, May 1989 

8. Gersho, A., Gray, R.M.: Vector Quantization and Signal Gompression. Kluwer 
Academic Publishers, 1992 

9. Gotze, M.: Adaptive Vector Quantization of Images in the Discrete Gosine Trans- 
form Domain. Picture Goding Symp. PGS’86, Tokyo, Japan, 1986 

10. Marescq, J.P., Labit, G.: Vector Quantization in Transformed Image Goding. Int. 
Gonf. on Acoust. Speech and Sgn. Proc., IGASSP’86, Tokyo, Japan, 1986. 

11. Rabbani, M., Jones P.W.: Digital Image Gompression Techniques. SPIE Optical 
Engineering Press, 1991 

12. Rak, R.J.: Signal Gompression based on Eourier Transform Vector Quantization. 
Mediterranean Electrotechnical Gonf. MELEGON’94, Antalya, Turkee, 1994 




368 



R.J. Rak 



13. Rak, R.J. : A System For Transform Vector Coding of Images. 3rd International 
Conference on Signal Processing ICSPT6, Bejjing, China, 1996 

14. Rak, R.J.: Wavelet Transform Vector Quantization of Images. 13th International 
Conference on Signal Processing DSP97, Santorini, Greece, 1997 

15. Rao, K.R., Yip, P.: Discrete Cosine Transform. Academic Press 1990. 

16. Saito, T., Takeo, H., Aizawa, K., Harashima, H., Miyakawa, H.: Discrete Cosinte 
Transform Coding System Using Gain/Shape Vector Quantizers and its application 
to Image Coding. Picture Coding Symposium, PCS’86, Tokyo, Japan, 1986 




Transform Vector Quantization of Images in One Dimension 



369 




Fig. 3. The original image and the ones achieved after vector quantization for differ- 
ent vector (block) sizes top-left: original image; top-right: recovered for the size 2x2 
(BR=1.5bpp, PSNR=32.56 dB); bottom-left: recovered for the size 4x4 (BR=0.375bpp, 
PSNR=24.96 dB); bottom-right: recovered for the size 8x8: (BR=0.09375bpp, 
PSNR=19.76 dB) 
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[ Abstract.] The wavelet compression method is one of the 
most effective techniques of digital image compression. Effi- 
ciency of this method strongly depends on the filters used to 
two-dimensional wavelet transform. The fundamental way of 
construction of finite impulse response filters was given by I. 
Daubechies. The paper presents a new proof of the fact that the 
Daubechies filters satisfy the power complementary condition, 
which is one of the conditions for perfect image reconstruction. 



In [Daul], [Dau2] I. Daubechies proposed a method of a design of filters 
for two-dimensional wavelet transform, which is used in image compression 
[Mai], [Str], [Rak]. Such filters must satisfy the power complementary or Smith- 
B arn well condition [Vai], [Str]: 

|i7oHp + |i7oy + yp = 2, (1) 

where is the Fourier transform of the impulse response of the filter, i.e. 

i7oH = (2) 

n 

In this paper we give a new proof of the fact that the polynomial 




is a solution of the equation (1). 



To show that (3) satisfy in fact (1) let us introduce a new variable x: 

1 — cos LU 



X = 



( 4 ) 



It is easy to see that x G [0, 1]. Thus it is enough to prove the following theorem 
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[ Theorem 1.] For x E [0, 1] and p > 1 we have 



n=0 ^ ^ n=0 ^ ^ 



[ Proof,]Let us first observe that (5) holds trivially for x = 0 and x = 1. So let 
us assume that x E (0, !)• For such x the formula (5) is equivalent to 



p-i 

= E 



'p F n — 1 



xP{l-x)P ^ J \xP-^ {l-x)P-^ 



( 6 ) 



Before we show (6) let us first prove a simpler statement. 



[ Lemma 1.] For x E (0, 1) and p > 1 it holds 



xP{l — x) xP 



X 1 — X 



[ Prop/.] The formula clearly holds for p = 1. Assume that it holds for some 
p > 1. Then 

11 11 

X^^+l(l — x) xP^^ x‘^ x{l — x) 

1 111 

— — n — 1“ ' ' * “I 9 “I 1“ z • 

xP^^ xF X \ — X 

Thus, by induction, it holds for all p > 1. 



Switching x and 1 — x in Lemma 1 we get 



1 1 

(1 - x)Px ~ (1 - x)P 



1 1 

i ^ ' 

\ — X X 



( 7 ) 



The proof of the following identity can be found in [GKP]. It holds for > 0: 




m T T 1 
k 



( 8 ) 



Now we are going to show (6). It holds for p = 1, so let us assume that (6) holds 
for some p > 1. We shall show that it holds for p + 1. From the assumption. 
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using Lemma 1 and (7), we get 

p-i 



1 






E 

n=0 

p-1 

E 

n=0 



p-1 

E 

n=0 



p n — 1 
n 



1 



1 



xP+^-^{l - x) (1 - x)P+^-^x 

p ^ n — 1\ f 1 1 1 



n 



xP^^~ 



1 



X 1 — X 

1 r 



- x)P+^~ 

p + n-l\^f I 



p-1 



+E 

n=0 

p-1 

■^E 

n=0 

p-1 / k 

+E E 

p=0 \n=0 



p T n — 1 



p T n — 



n 



■ + 


1 — X 




1 A 


1 


— X J 


-1 

- / 


' 1 


-• \ 

-n 


^xPP^~ 




1 t 



1 






X 1 — X 



1 



1 



p T n — 1 



Now we can use (8) and the simple identity 

^2p— 1\ f2p 



to get 



xP+^(^l — x)P^^ \p J \x 1 — X 



p — 1 



2p\ fl , 1 



p 



p-1 

E 

k=0 



p^k\ [ 1 



k J \xP+^-^ {l-x)P+^-^ 



E 

k=0 



p + k\ f 1 



1 



k J \xP+^-’^ (1 - J ' 



Thus (6) holds for p + 1 and, by induction, for every p > 1. 
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[ Abstract.] In this paper discovery of default knowledge as pro- 
posed by Mollestad [7] , [8] , [9] , [10] is further investigated. 
Mollestad’s algorithm, as described in [9], is refined and ex- 
tended in several ways. In particular, new heuristics guiding 
the search for default decision rules are proposed and evaluated. 
The results so far have been encouraging when the (modified) 
framework is compared to other rough set methods. 



1 Introduction 

Knowledge Discovery in Databases (KDD) [2], [15] is motivated by the need for 
automated^ efficient^ and intelligent methods for data analysis (summarization, 
clustering, classification etc.). The ultimate goal of KDD is to discover knowledge 
in the form of useful patterns from raw data. 

Rough set theory has been used in KDD and machine learning [21] to find 
decision rules from data tables [14], [19], [20]. In an abstracted form, these rules 
are written like A ^ and interpreted as “A implies B”, or equivalently “if 
A then B”. Such rules are appealing as they closely resemble the way humans 
represent knowledge in everyday parlor. 

In [7], [8], [9], [10], Mollestad et al. proposed to synthesize default decision 
rules ^ i.e. rules that have exceptions, by systematically removing attributes from 
the given data. The motivation for inducing rules in this way is twofold. First, 
data is in general uncertain (due to noise and errors) and inconsistent. This im- 
plies that even correct rules in general are approximate. Second, by removing 
information in the induction, the resulting rules describe more general trends in 
the data, and equally important, these rules are particularly suitable for classi- 
fication of nevj objects with missing values (situations with limited knowledge). 

Mollestad’s algorithm is, to the best of our knowledge, the first approach 
to automatic synthesis of large knowledge bases for default reasoning. We pro- 
pose heuristics for an improved version of his algorithm and test the extended 
framework on real data. The quality of our classifiers seems to be better than 
those obtained by other rough set methods, and though the costs remain high, 
the heuristics have greatly contributed in making the approach feasible. The 
reported work is in progress, and further validation is required, both for the 
framework in general and in particular for the heuristics. 
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In the following section, ideas and terminology from default reasoning is 
reviewed. Section 3 recapitulates rule generation in the rough set approach, and 
Mollestad’s algorithm is briefly recalled in Sect. 4. Heuristics are proposed and 
discussed in Sect. 5. Results are presented and commented in Sect. 6, and we 
give preliminary conclusions based on our results in the last section. 



2 Default Reasoning 

Very informally, default reasoning is about “jumping to conclusions” on the 
basis of limited information, and is omnipresent in common sense reasoning with 
inconclusive evidence. Default reasoning is a case of nonmonotonic reasoning: 
The set of consequences does not necessarily increase as new axioms are added, 
and the addition may invalidate previous consequences. 

Several formal approaches to default reasoning have been proposed. The De- 
fault Logic (DL) of Reiter [17] (also [1], [6]) formalizes default reasoning by 
the use of default rules. A DL-theory consists of two disjoint sets of formulae, 
the defaults and the facts (axioms). The facts are assumed to be logically valid 
pieces of knowledge, whereas the defaults describe relations between facts which 
in general have exceptions. Reiter A approach has two disadvantages: 1) It is 
necessary to a priori know all exceptions to a default rule and 2) In the case 
of conflicting defaults, DL offers no means for deciding which is correct (cf. the 
multiple extensions problem). 

An alternative formalization is offered by the Preferred Subtheories of Brewka 
[1]. His ideas generalize the Theorist system of Poole [16]. Also in Theorist, the 
formulae are divided into a set of facts ^ and defaults^ A. Facts are assumed 
to be irrefutable while the defaults may be counter-proven. Brewka proposed 
two generalizations to this two-level prioritization. The first is to partition all 
formulae T (corresponding to F U A) into k mutually disjoint sets 7o, . . . , T^, 
where a formula in is considered more reliable than a formula in any Tj when 
j > i. This approach has two drawbacks: 1) Any formula must be judged against 
all others, and 2) There may still be conflicts between two formulae (rules) with 
the same “reliability” . The other generalization amends this by defining a partial 
order between all and only those rules that may conflict. 

3 Rule Generation in the Rough Set Approach 

Rough set (RS) theory was introduced by Pawlak [12], [13] as a formal tool for 
approximate reasoning. The basic construct is an information system defined as 
a pair A = (V, A), where L is a finite, nonempty set of objects, and A is a finite, 
nonempty set of attributes. Each a G A is a total function, a : U ^ Fa, where Va 
is the value-set of a. A decision table (DT) is an information system where the set 
of attributes A is divided into two disjoint, nonempty subsets C (conditions) and 
D (decisions). Hence forward, D will be assumed to be a singleton, i.e. D = {d}. 
For a DT A = (V, C U {d}), the indiscernibility relation with respect to C C 
in A, IN D^{B)^ is defined as IN = {(x,y) G | Va G a(x) = a(y)}. 
The equivalence class of x G L in IN Dj^[B) is denoted [x]b- A reduct in A is a 
minimal (with respect to inclusion) set of attributes RCA such that IN Dj{{R) 
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= IN D^{C). The set of all reducts of A is denoted RED^. The discernibility 
matrix of ^ = (D, C U is defined as the n x matrix where 

n = \U\ and niij = {a E C \ a[xi) A For the discernibility matrix 

modulo decision of is defined as = {a G C | a(x^) A if 

d{xi) A d{xj) otherwise rnij = 0. 

Given a DT A^ an expression of the form a = u, where a G (C U {d}) and 
G Va is called a descriptor in A. Let Ca denote the set {t\t = (ci =i;iA***A 
Cjfc = '^k)}: where A: > 0, q G C, G 14- , i = ^ ^ £^^d Ci A G'G 7^ T i^f 

T>A denote the set {r‘|r' = [d = vi \/ ••• V d = vi)} ^ where I > 1, i;i, . . . , i;/ G 14- 
The decision rules of interest is then given by the set RULES a = {t ^ r‘|r G 
Ca: a € Lor T G CaL^Ua^ [t] denotes those x G D satisfying the conditions 

stated in r, i.e. if r is a descriptor a = u, [r] = {x G G | a(x) = x}, if ^ At 2 , 
[r] = [ri] n [t 2 ], if r = n Vt 2 , [r] = [ri] U [t 2 ], and, if r = ^ri, [r] = U - [ri]. For 
a rule r = (r ^ r') G RULESa^ where r‘ = d = xi, . . . , d = x/, the probability 
of r in Ay Pa{^): is defined as Pa{^) = = Vi] H [x]|/|[r]|}. The 

support of r in Ay (^Ai^): is defined as cr^(r) = |[x]|. 

4 Default Rules 

This section reviews Mollestad‘s framework for generation of default rules (as 
presented in [9]). For a DT Ay a rule r G RULES a is said to be definite if 
= 1 (s^ka true in A). In [9], a default rule is defined as a rule with pa{^) E 
pty where 0 < /xt < 1. In this paper, the term default rule will mostly be used 
for rules with probability strictly less than 1. 

Mollestad’s algorithm may in loose terms be described as a search for default 
rules over the power set of the condition attributes. A DT A! = {UyC' U {d}) 
obtained from a DT A = (D, Cu{d}) by deleting a (nonempty) set C — C y is 
called a variant of A. Mollestad proposed to create new variants by removing 
attributes so that indeterminacies are introduced, and subsequently generate 
rules from all resulting variants. A (new) indeterminacy is introduced by re- 
moving attributes discerning objects with different decisions. To formalize, let 
I ^ ^ Ad^yf A 0}? ke. the set of nonempty discernibility factors 
modulo decision. Furthermore, let = {f E I G Then a 

constructed variant is defined in the following inductive way: i) (Base case) A is 
constructed, ii) (Induction step) if A' = {UyC' U {d}) is a constructed variant, 
then the DT A" = (G, C" U {d}) is constructed if and only if C — C" G ^5'* 



5 Heuristics 

Even the set of constructed variants may have size exponential in the number 
of condition attributes. This forces heuristics to be employed in order to further 
restrict the set of variants from which to generate rules. The following heuristics 
can be divided into two categories: 1) Restrictions on the set of discernibility 
factors used when generating new variants, and 2) Performance-based heuristics 
combined with thresholds for cutting search. 
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5.1 Factor Filtering Heuristics 

The notion of a constructed variant is based on the set of discernibility factors 
that are not supersets of any other factor. This implies that at each step a “min- 
imaF number of attributes are removed. In order to faster find the most general 
rules, and also to find a larger number of new rules in each new variant it is 
better to remove larger sets of attributes. In this respect, using the “longest” 
factors in the set <P^ seems a better option. To this end, we propose the follow- 
ing heuristics for filtering the set <P^\ 1) Use only factors that are longer than 
average^ 2) Use the c x longest factors (c G (0, 1] is a parameter), and 3) 
Use the N longest factors (parameter A^). 

Basing filtering criteria on factor length is a subcase of a more general strat- 
egy employing attribute costs (length corresponds to uniform costs). As a refine- 
ment to the above heuristics we propose to assign costs to each attribute and 
calculate costs for each factor as the cost of the attribute sets of the resulting 
variant. The cost of an attribute may reflect resource- requirements with respect 
to acquiring information on an attribute or represent an inverted measure of the 
attributed information content. Strategies 1) to 3) may thus be rephrased in 
terms of least cost factors. 



5.2 Performance-Based Heuristics 

These heuristics are based on the assumption that removing attributes from a 
variant with high “performance” along various dimensions results in new variants 
which also perform well. The purpose of the discovery process may be to And 
strong individual patterns, to And compact, yet accurate models of the data, or 
to And classiAers for prediction of new objects. 

Assume a DT A = (U, C U {d}) is given, and that A' = {U^C' U {d}) is 
a variant of A. In [9], the following two measures of variant interestingness are 
proposed: glued[A') = \{x E U \ [x]c C £^nd kept[A') = \{x E U \ d[x) = 

d[[x]c^)}: where d[[x]c^) is the decision supported by the most objects in [x]c^^ 
Larger equivalence classes results in more general rules (high value of glued) ^ 
and variants with a high A:ept- value result in rules with high probabilities. 

If the purpose is to And classiAers or accurate models, the rule accuracy of 
a variant (the variant’s rule set) is an obvious heuristic. Another measure of 
classiAcational power is found in the area under the Receiver Operating Charac- 
teristic (ROC) curve [3]. If classiAer performance is the only issue, it seems as 
a good idea to maximize performance on one (or both) of these two measures. 
As a description of the given data, model transparency (comprehensibility) is 
important. When smaller models are desired, the complexities of resulting sets 
of rules and reducts may be suitable heuristics. Furthermore, measures of inter- 
estingness of individual rules should be employed when the aim is to And the 
strongest patterns. 

The proposed measures of variant “quality” can be combined with thresholds 
to decide if further search should be stopped, or together with a maximum, 
A, for the total number of variants in order to guide search to the N most 

^ This was initially proposed in [9] . 
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promising variants based on performance on any of the proposed measuring 
functions. These heuristics are readily employed in conjunction with the notion 
of constructibility, but may also be used alone. 

6 Experiments and Results 

To give an indication of the value of synthesizing default rules, we compare 
the rule sets found with our algorithm with models obtained by “standard” 
RS methods. All experiments have been carried out with the Rosetta [18], [22] 
toolkit wherein the framework for default rules generation has been implemented. 
The available reduct calculation algorithms are: Dynamic, Exhaustive, Genetic, 
and Johnson. In all experiments, object-related reducts were calculated, and 
within the default rules framework the Exhaustive (exact reduct calculations) 
algorithm was invoked for each variant^ 

The following data sets, all from the UCI Repository of Machine Learning 
and Domain Theories [11], were chosen for analysis: Australian Credit Card 
Approval (AUS), Cleveland Heart Disease (CLE), Hepatitis (HEP), and Breast 
Cancer (BCO) and Lymphography (LYM) from the Oncology Institute.^ The 
test strategy was single train-test validation by randomly splitting the original 
data set into two disjoint sets; training on one and validating against the other. 
Eor domains with continuous attributes, scaling was carried out after splitting 
by discretizing the training table and subsequently the test table was scaled with 
the same cuts. The best accuracies obtained with the default rules algorithm are 
contrasted with other rough set methods in Tab. 1. Note, for the Dynamic and 
Genetic algorithms all parameters were kept at their default values (cf. [18]). 



Table 1. Comparison of default rules synthesis to other rough set methods. 
Number in parenthesis refers to percentage of objects (of total data set) used for 
training 



Method 


AUS (20) 


CLE (30) 


HEP (30) 


BCO (30) 


LYM (40) 


Default 


.8714 


.8374 


.9074 


.7600 


.8315 


Dynamic 


.8315 


.8079 


.8241 


.7300 


.7416 


Exhaustive 


.8424 


.7783 


.8148 


.7150 


.7416 


Genetic 


.8406 


.7734 


.7963 


.7150 


.7640 


Johnson 


.7736 


.7734 


.7870 


.6600 


.6854 



In order to assess the heuristics, train-test pairs from three of the domains 
were taken for further analysis using the factor filtering heuristics (uniform costs) 
with the default rules algorithm. Table 2 summarizes results from this part of 
the experiments. 

^ Employing approximate reduct calculations may reduce time requirements, but the 
current implementations of the Johnson and the Genetic algorithms had deficiencies 
rendering this impossible. 

^ For detailed information on these domains, the reader is referred to [11]. 
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Table 2. Factor filtering heuristics. AVG refers to the strategy of using factors 
longer than average^ R (c) refers to using the c x longest factors^ and N 

(n) refers to using the n longest factors 



Heuristic 



Domain 


None 


AVG 


R (.40) 


R (.60) 


N (100) 


N (300) 


AUS 


.8714 


.8659 


.8659 


.8659 


.8587 


.8659 


CLE 


.8374 


.8177 


.8177 


.8227 


.7783 


.7980 


HEP 


.9074 


.8704 


.8704 


.8704 


.8704 


.9074 



6.1 Comments and Observations 

A caveat is required before commenting the results: The accuracy of the best 
variant’s rule set cannot immediately be recognized as the performance of the 
algorithm as a classifier inducer (in that case, it would be necessary to predict 
this variant from the training table). However, what the results in Tab. 1 do 
suggest is that valuable information is missed when rule generation is based on 
the discernibility of all attributes, and hence, that default rules synthesis actually 
is worthwhile. From all data sets, a large number of variants giving rule accuracy 
better than the input tables were found. Considering model size, the advantage 
of these variants is further strengthened, and even if the rule sets of variants did 
not obtain better accuracy than the standard RS methods, they are valuable in 
offering more compact and thus more comprehensible models. 

Filtering the factor set significantly reduced the total number of variants 
and in particular variants with many attributes. This resulted in a considerable 
boost of efficiency, though, as indicated by Tab. 2, with some loss of information. 
Removing all factors with less than A: attributes, for some 1 < A: < |C| from a 
DT A = (C, C U {d}) effectively excludes all variants with more than \C\ — k 
attributes. If the optimal, in some sense, variant has more than this number of 
attributes, a filtering strategy removing all factors with less than k attributes is 
bound to miss this variant. A priori finding the correct value of this k is clearly 
not trivial, and in using any filtering strategy this aspect should be kept in mind. 

Comparing the obtained results with previously reported studies is difficult. 
Actual observed rule-accuracy is a product of a number of steps each having 
several options, and without detailed knowledge of the constituent steps in other 
studies, we refrain from making any concrete comparisons. In the experiments 
we used very small training tables compared to, for instance, what would be 
the case in a 10-fold cross-validation setting, and though our results are based 
on a single partitioning split this fact suggests that the reported accuracies 
are fairly conservative estimates. Even so, the obtained accuracies, also for the 
heuristics, compare quite well with numbers reported elsewhere (cf. [4], [9], [11]), 
which indicate that defaults generated from a small number of training examples 
generalizes well to new data. It is interesting to observe that the default rules 
perform particularly well on the HEP and LYM data sets which both has “many” 
attributes. A plausible, albeit ad hoc, explanation for this is that there may be 
several irrelevant attributes which do not contribute in prediction but confuse 
the standard RS approaches. 
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7 Conclusions 

The heuristics significantly reduced time requirements and gave encouraging re- 
sults with respect to performance of the best variant. Among the factor-filtering 
heuristics, the strategy of using the cx longest factors seems more promising 
than the other two. This strategy is more flexible than considering only factors 
longer than average, and though the A^- longest strategy also can be adjusted it 
is absolute in that it does not consider the actual number of factors. Filtering 
on attribute costs have not been tested yet, but is an aspect that should be 
given high priority in future experiments. The performance measuring heuris- 
tics have not been assessed in these experiments. In [5], promising results were 
obtained combining performance heuristics with a maximum number of variants 
in a bottom-up search. 

For all practical purposes the collection of all synthesized rule sets is unwieldy, 
and there is a high degree of overlap in that several rules are repeated. We pro- 
pose the following three ways to utilize the result of default rules generation: 
1) Extract a specified number of the most interesting rules, defined by perfor- 
mance on some rule-interestingness measure, 2) Collect the union of all rules 
and construct a prioritized knowledge base in which a partial order among rules 
is defined by ordering on rule-specificity (cf. partially ordered Preferred Subthe- 
ories, Brewka), and 3) Refine the algorithm by defining a predictor-function in 
order to predict a variant and return its rule set as the result of the algorithm 
as a classifier inducer. 
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1. Discrete Random Variables 

In this section, we will examine the classical random variables from the point of view 
of belief functions. Let (C, 0, Pr) be a probability space, where C is the universe, 0 is 
a o-algebra, and Pr is a probability measure [2]; we will be interested in the case C is a 
finite set. 

1.1. Suppose C is a finite set. A real-valued function X: C^R, is a measurable func- 
tion, if all its inverse images are measurable (this definition is equivalent to that of [2] 
only when C is finite). Let the set of distinct values be S = {aj, a 2 ,..., a^^}. We will be 

interested in the inverse image X \aj) of aj under X; for convenience we will write 
Xj = X'^(ai)={c I c G C and X(c)= a^ } 

By abuse of notation, we also use X to denote the collection of inverse images 

X={Xi:i=l,2,...,n} 

So X is a random variable as well as the collection of inverse images, which forms a 
partition on U. The collection X generates a o-algebra 0(X) e 0. 

1.2. A new probability Px can thus be defined by setting Px(Xj) = Pr(Xj), and ex- 
tending it to 0(X) additively, namely, for Y=UjXj, 

Px(Y) = Zj Px(^j )’ where j varies through a finite subset of {1, 2, .. , n}. 

In particular, the total sum Px(C) = 5^^i=l Px(^i) =1- So (C, 0(X), Px) is a new 
probability space. 

1.3. The inner probability can be expressed by 

PX*(A) = SUP{Ij Px(Yj ) I A 3 Y= uWj and Yj g X}, 



where SUP is the least upper bound. 
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1.4. So given a random variable, we have 

(51) a partition X on C, and 

(52) a numerical value, m(Xj )=P^(X ^ ), i = 1,2, . . ., n, such that 

(53) XVim(Xi)=l. 

and the inner X-prob ability is the belief function. 

Bel(A) = Px*(A)= SUP{Ij m(Yj) | A 3 Y= Yj and Yj g X} 

It is clear Shafer theory is a generalization of random variables, in which (SI) is 
weakened to a partial covering, yet the theory computes belief functions as if X were 
disjoint. This observation spells out the precise point of our motivation in re- 
interpreting Shafer theory [6], [7], [8]. The main goal of this paper is to re-interpret the 
notion of focal elements: even though they appear to have non-empty intersections, 
their basic probability assignments are, in fact, "measures" of disjoint fuzzy sets. Fo- 
cal elements are merely the support of these fuzzy sets. 

1.5. Let A be any subset of C. The lower and upper approximation are defined by 
L[A]= {Yj e X: 3 Yj c A}; H[A] = {Yj e X: V Yj , Yj n A 0 } 

It is clear that L[A] g 0(X), and we have 
Proposition Belx(A) = Pxmc(A) = Px(L[A]) 



Similar results, mainly on concrete counting measures, are obtained by many rough 
setters; e.g. see [9] for references and exposition. There are generalizations ([6], [8], 
[ 10 ]). 

2. Belief Functions -- Shafer Theory 

2.1. Let Q be a finite set and 2^ be its power set. A unit interval valued function 
m : 2^ ^ [0, 1], is a basic probability assignment (bipdi) if 

m( 0 ) = 0 and 

X^m(EJ =1, where varies through 2^. 



A set E^ is a focal element, if m(E J ^ 0. 
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2.2. Belief function Bel can be defined by basic probabilities: 

Bel (A) = X {m(B): B e A}, where the summation runs through all subsets of A. 

2.3. Let C = {1, 2, 3, 4, 5, 6 , 7, 8 }. Two focal elements and their bpa’s are 

X={I,2,3,4,5} and Y={2,3,4,5,6}; 

m(X)= 2/3, m(Y)= 1/3, and other bpa’s are zero. 

Then, by definition, 

Bel(X) = 2/3, Bel(Y) =1/3, and Bel(X n Y)=0. 

The belief measure of atomic intersection is zero; an intersection is called atomic if it 
does not contain any focal element. Intuitively, it implies that the evidences of any two 
atomic events are always “independent;” atomic event is an event that has no sub- 
events with positive evidences. 

2.4. Let C ={1, 2, 3, 4, 5} be the universe. Let B be a binary relation B e C x C, 
which is defined by 

Bi={x| (x,I)G B} = {!,2,3} 

B2= { X I (x,2) E B} =63 = { X I (x, 3) E B}={1,2, 3,4} 

B4={x| (x,4)e B} ={2,3,4}, B5= { x | (x,5)e B}=0 

Each Bj consists of those points that are B-related to the element i; they are referred to 

as elementary B -neighborhoods. We will show that "direct'’ generalizations are inva- 
lid: Let 0(B) be the o-ring generated by {Bj, B 2 , B 3 , B 4 , B 5}. Suppose a measure 

is defined on 0(B) as follows: 

Prg(B 1 ) = Prg(B 4 ) = 3/10, Pv^(B2) = Pr^CBj) = 2/5, Prg(B 5 )= 0. 

Its inner measure is: 

Prg*[(l, 2, 3, 4, 5}] = SUP(Prg(BiuB4uB5), Prg(B 2 uB 5 ), Prg(B 3 uB 5 ) 

= SUP{Prg(BiuB4), Prg(B 2 >, Prg(B 3 )}= 2 / 5 , since B 5 = 0 . 

Next, we take the measure of Bj as bpa 's, 

mg(Bj)= Prg(Bj), i=l, 2, 3, 4, and other bpa's are zero, 

and we have 

Belg({l, 2, 3, 4, 5}) = 2, 3})+ 3, 4}}+ 2, 3, 4}) 

= Prg(Bj)+ Prg(B 4 )+ Prg(Bj u{4}) = 1 
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The difference results from the fact that belief value sums up the multiple measures of 
the overlapping areas', is measured twice in belief value, only once in the inner 

probability. We believe that bpa’s should not be interpreted as measurements of focal 
elements, they are measurements of fuzzy sets whose supports are focal elements. 
With such interpretations, belief functions are the inner probabilities of granular fuzzy 
sets [3], [4], [5]; this is the subject of next section. 

3. Belief Functions - A New View 

3.1. Suppose we are given a bpa, m : 2^ ^ [0, 1]. Let Yj : i = 1, 2, ..., s be the 
focal elements. Let be the characteristic function of Yp Then we define our target 
membership functions (fuzzy sets and membership functions are synonyms) by 

(1) TYi(c)=m(Yi)Xi(c),i=l,2,..,s. 

We write n = s+1 and define: TYj^(c) = 1 - Z^i=i TYj(c) , or equivalently, 

(2) XVlTYi(c)=l. 

It is clear that TYjj(c) > 0; it is a legitimate membership llmction. Next we set bpa of 
TYj, i=l,2,...,n: 

(3) m(TYj) = m(Yj), i=l,2, s, and m(TYji) = 0. 

(4) XVl m(TYj) = X®i=i m(Yi)Xi(c) + m(TY^) =1 

3.2. Let us recall the notion of fuzzy partitions introduced in [1]. A family of fuzzy 
sets FXp i = 1, 2, . . ., t is said to be a BH- partition, if X^i=i FXp(c) =1 

Proposition {TYj | i = 1, 2, .., n} is a BH-partition. 

3.3. Let U = C X [0, 1] be called the total space [5]. Consider the natural projection 
NP: U ^ C. We will show that there is a crisp partition {Xj} on U such that 

TYj(c) = a ([c] n Xj ) / a([c]) = a ([c] n Xj ) 

where [c]= c x [0, 1], and a (c x S)=a(S) is the Lesbegue measure of S e [0,1]. Given 
a membership function TYj we will define Xj as follows: At each point c g C, we set 
cq = 0, =1. For i = 1, 2,.. , n, we define half-open intervals. 



= [cp Cj+j) such that TYj(c) = Cj - Cj_j, and set 
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Xj = { c X [cj_j, Cj) : c G C} 

Such an Xj is called a realization of TYj ; it is a crisp set representation of a fuzzy set 
TYj . The 6-tuple 

(U,NP,C, a, {TYi:i=I,2,.., n}, {Xpi = I, 2,.. , n}) 

is called the context of {TYpi = I, 2,.. , n}. Since Z^i=i TYj(c) = 1, 

Ic = Cj) : i =1 , 2, n} is a crisp partition on [0, 1], and 

X = {Xj : i = I, 2, . . n} is a crisp partition on U= C x [0, 1]. 

So, we have 

(51) a partition X = {Xj : i = I, 2, n}on U, and 

(52) a numerical value, m(Xj ) = m(TYj) =m(Yj), i = I, 2, ..., n, such that 

(53) XYim(Xi)=l 

These three conditions clearly say that X can be treated as a random variable on the 
total space U. Next, we will define a belief function on C by X. 

Belx(Ax [0, 1]) = Px*(Ax [0, 1]) = SUP{Ij m(Xj) | Ax [0, 1] 2 uXj } 

Observe that Ax [0, 1] 3 uXj if and only if A 3 uXj , so, we have 

Px*(Ax [0, 1]) = SUP{Ij m(Xj) I Ax [0, 1] 3 uNj } 

= SUPjlj m(Yj) I A 3 uYj }= BelyCA) 

So we have defined BelyCA) in terms of the probability P^ of random variable X. 
Therefore Bely is a well-defined set function. 



5. Conclusions 

In this paper, we show that we can view the bpa of focal elements as probabilities of 
fuzzy sets that have those focal element as their supports. Using the notion of granular 
fuzzy sets [3], [4], these fuzzy sets on base space C can be realized by crisp sets on the 
total space U. Thus we have the setting of classical random variables on the total 
space. We show that the belief functions of fuzzy sets on C are the inner probabilities 
of the random variables on U. So belief functions is a sound and well-defined set 
function. These findings also imply that we can introduce an additive fuzzy measure 
theory, in fact it is a generalized function on membership functions; this will be re- 
ported in near future. 
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Abstract. This paper extends the notion of information tables and con- 
cept hierarchies of equivalence relations to binary relations. So extended 
rough set theory and attribute oriented generalization techniques can be 
used to mining binary relations in data. 



1 Introduction 

By giving each elementary set (equivalence class) a name, we represent a universe 
by an information table. If we apply the same naming procedure to a nested 
sequence of equivalence relations, we get a concept hierarchy [2,4]. So traditional 
data mining is essentially a mining of equivalence relations. 

In [7], Lin showed that by naming each granule of binary relations (elemen- 
tary neighborhoods), we represent the universe by, roughly speaking, a ’’topo- 
logical” relational data model, or precisely speaking an information tables with 
neighborhood systems [15,14,3]. Similarly concept hierarchies can be extended 
to nested sequences of binary granulations [16]. Thus we have an environment 
of computing binary relations, and hence we can extend the traditional data 
mining techniques to mining binary relations [2,5,4,10,18,11,6]. 

2 Binary Relations and Binary Neighborhood Systems 

We will recall few relevant terms from the theory of neighborhood systems 
[15,9,6]. All results are valid for both crisp and fuzzy worlds. For simplicity, 
we will state the results in crisp terms. Let U and V be two crisp sets; V is 
called an object space, U a data space. 

^ partially supported by EPRI, SJSU, NASA NCC2-275, ONR N00014-96-1-0556, 
LLNL 442427-26449, ARO DAAH04-961-0341, and BISC at UC-Berkeley 
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1. Binary neighborhood system: With each object p e we associate a (pos- 
sibly empty) subset Bp, called the elementary, binary or basie neighborhood 
at p. The map 

B : V — ^ 2^;p — ^ Bp C U, or the collection {Bp} 

is referred to as the binary neighborhood system. 

2. Binary relations: B is said to be a binary relation on U and V if B CV xU. 
A binary relation defines a binary neighborhood system by setting 

Bp = {u\ (p, u) e B}. 

If the binary relation is an equivalence relation (V = U), elementary 

neighborhoods are pairwise disjoint; so they form a partition. We will regard 
an elementary set as the neighborhood of its points. In this paper, we will 
use elementary sets in this sense. See example below. 

3. A subset X C is a definable neighborhood/set, if X is a union of el- 
ementary neighborhoods/sets. If a definable neighbor hood/set X contains 
a elementary neighborhood/elementary set Bp, it is a definable neighbor- 
hood of p. In fuzzy world, the union is expressed by means of a t-norm, for 
example, max operation. 

4. For simplicity, a space with a neighborhood system is called a NS-space; it 
is a mild generalization of Frechet(V) space. 

5. Clump System: A neighborhood system is called a clump system , if there 
are additional information structures imposed on the neighborhood system 
[6,7,8]. In this paper, a concept hierarchy is the additional information struc- 
ture; see Section 3. In general, information structure is an intuitive, not a 
formalized, notion; 

3 Concept Hierarchies of Binary Relations 

A concept hierarchy is best represented best as a tree. A node represents a 
concept c. Each parent node represents the concept immediately more general 
than (strongly depend on; see below) c and each child node represents a concept 
immediately more specific than c. The leaf nodes have no children and repre- 
sent the base concepts. Traditional concept hierarchies require that all sets of 
siblings form elementary sets. In this section, we relax the equivalence relation 
to a general binary relation (among sibling concepts). Siblings form elementary 
neighborhoods, not necessarily elementary sets [16]. 

1. Let B^ and B^ he two binary neighborhood systems. We say B^ is strongly 
depended on B^, denoted by 

B^ B^, 

iff every ^^-neighborhood is a definable ^^-neighborhood. 

2. If B^ => B^, we will say B^ is definably finer than B^ or B^ is definably 
eoarser than B‘^ . 
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A concept hierarchy is a nested sequence of neighborhood systems, in which 
each elementary neighborhood is given a name, called an elementary concept or 
simply concept: 

1. is an elementary neighborhood C U] 

= NAME(Bj-) is a 0^^ level elementary concept. 

2- Bj, = U{b;. I B°, is in a neighborhood subsystem } is a level definable 
neighborhood; 

Cj,i = NAME{Bj^-) is a level elementary concept. 

3. = [J{Bj - I Bj - is in a neighborhood subsystem } is a (/c + 1)^^ level 

definable neighborhood; 

= NAME{Bj^^'^^) is a (/c + 1)^^ level elementary concept. 

4 Granular Structures and Representations 

An equivalence relation decomposes the universe into pairwise disjoints elemen- 
tary sets. A binary relation decomposes the universe into elementary neighbor- 
hoods that are not necessarily disjoint. The decomposition is called a binary 
granulation or a binary neighborhood system [6,7,8]. 

1. A binary granular structure consists of 4-tuple 

{V,U,B,C\ 

where (1) V is the object space, (2) U is the data space {V and U could 
be the same set), (3) B is the binary neighborhood system or binary clump 
system (see Section 2,Item 5), and (4)C the concept space which consists of 
names of elementary neighborhoods. 

2. A binary granular structure is called a rough structure, denoted by ( E^ C), 
if R = (7, is a partition. 

3. A binary knowledge base is a finite collection of binary granular structures; 
it will be denoted by (V, [/, {5^, CiA = 1,2, ...,n.}) or simply {BiA = 
1, 2, . . . ,n.}. We will assume all granular structures have additional struc- 
tures, namely, concept hierarchies. If V = U and 5 is a partition (with 
no-additional structure) Pawlak call it knowledge base [20]. 

4. A knowledge is referred to a granular structure with one single binary neigh- 
borhood system or clump system, for example (V, [/, Bi^^CiA) or simply Bi^. 



4.1 Extended Tables of Binary Granulations 

Zdzislaw Pawlak showed that a knowledge base can be represented by an in- 
formation table and vice versa [20]. T. Y. Lin extended his notion to binary 
relations [7]; in this paper, we consider a mild generalization, namely, to binary 
granular structures. 

Let (V, t/, {5^, Ci, i = 1, 2, . . . , n.}) be a given binary granular structures. For 
simplicity, we may suppress the index i, namely, Bi = B^Ci = LEARN i = 

LEARN . At level i, we consider the mapping LEARN [17], that is. 
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LEARN :V — > C] 

that maps an object p to its unique elementary neighborhood Bp C then 
to its name Cp = N AME{Bp) G C (called elementary concept). In notations, 

p — y Bp — ^ NAME {Bp). 

Intuitively, the mapping, LEARN ^ generalizes a data to a i-th level concept. A 
collection of such LEARN’s forms an extended information table [7]. We should 
like to point out that, if the binary granular structure is a rough structure, then 
the extended table reduces to Pawlak’s information table. However, we should 
caution readers that unlike classical rough set theory, the entries in the extended 
table are not semantically independent; there are inter-relationships among data; 
see Table 2. Rough set theory is a theory of ’’name(word) processing” on discrete 
words (a crisp set), while ours is on clustering words (a NS-space). 

It is interesting to observe that we reached the same conclusion from pure 
database point of view about a decade ago; see earlier works in [15,13] and later 
references and exposition [3,19,6]. 

Examples 

To avoid unnecessary complex notations, we will skip the concept hierarchies in 
the example. Let the object and data space be the same, U = V = {1, 2, 3, 4}. 

1. B is a binary neighborhood system defined by 



Bi = {2,3,4}; B2 = {1,2}-, ^3 = ^4 = {3,4} 



Equivalently, it can be expressed as a binary relation: 



B = {(1, 2), (1,3), (1, 4), (2, 1), (2, 2), (3, 3), (3, 4), (4, 3), (4, 4)} 

Elementary concepts of B are: 

NAME{Bi) = all; NAME{B 2 ) = middle; 

NAME{B^) = NAME{Bi) = large 

2. E is an equivalence relation defined by 

Ei=E2 = {1,2}; E3 = E4 = {3,4} 

These neighborhoods (elementary sets)forms a partition: 

{1,2}; {3,4}. 

Elementary concepts of E are: 

NAME(Ei) = NAME{E 2 ) = low; NAMEI^Es) = NAME{Ei) = high. 
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Objects 


(y, U, B, C)-attribute 


(U, U, E, C)-attribute 


IDi 


all 


low 


ID 2 


middle 


low 


ID 3 


large 


high 


ID 4 


large 


high 



Table 1. Binary granulations from Example; Entries are semantically interre- 
lated; see Table 2 



(y, U, E, Cs)-attribute 


(y, U, E, Cs)-attribute 


all 


large 


all 


middle 


middle 


all 


middle 


middle 


large 


large 



Table 2. Exemplary semantic relations between entries in Table 1 



4.2 Knowledge Bases 

Pawlak introduced the notions of cores and reducts for knowledge base. These 
notion can be extended to binary granular structures; the detail will appear 
elsewhere. We summarize the notions in Table 3 [7]. 

5 Mining Binary Relations in Data and Conclusions 

I. Mining Environment: 

1. Domain: Extended information tables. Unlike rough set theory, these entries 
in the extended tables are semantically related; see Table 2. 

2. Background Knowledge: Eor each attribute, there is a nested sequence of 
binary relations; nested by strong dependencies. 

3. Level of Target Knowledge: By forming extended information table on dif- 
ferent levels, we will have different level of rules. For example, let us form 
an extended table with level concepts and (/c+ level concepts. If the 
knowledge level is at 0^^ level, we call them hard rules. If k > 1 and j = 0, 
we get high level rules of the same level. If j > 1, we get different level of 
soft rules [18,6] 

II. Mining Methodology: 

1. Rough Set Approach: By applying table processing techniques of rough set 
theory, we get the rules from different levels [18,16]. 

2. Other approaches: For example, association rules or its combination with 
rough set theory [1,10,11,12]. 
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knowledge 
oriented terms 


rough set 
theory 


single level 
granulation 


multilevel 

granulation 


knowledge 

(geometric) 


partition 

(classification) 


binary granular 
structure 


granular 

structure 


knowledge 

(algebraic) 


equivalence 

relations 


binary 

relations 




granule 


elementary set 
(equivalence class) 


elementary 

neighborhood 


fundamental 

neighborhood 


concept 

space 


elementary 
concept space 


elementary 
concept space 


fundamental 
concept space 


knowledge 

base 


Pawlak 

knowledge base 


binary 

knowledge base 


formal word 
knowledge base 


knowledge 

Representation 


information 

table 


extended 
information table 


formal word 
table [8] 



Table 3. Knowledge Bases of Granulations 



6 Conclusion 

Table processing methodology of rough set theory and attribute oriented gener- 
alization (AOG) of data mining are powerful techniques and notions, they are 
extended to binary relations. The techniques of traditonal data mining of dis- 
crete data (a crisp set) can be extended to that of clustered data (A NS-space). 
We are exploring more, in fact, extending them to general neighborhood sys- 
tems; we have some success in fuzzy world [8]. Applications are on the way; we 
will report them in the near future. 
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[ Abstract.] We consider discovery of strong decision rules from 
information systems defined by continuous attributes. The rule 
discovery algorithm Explore is extended by introducing a new 
method that handles continuous attributes while constructing 
the elementary conditions in the way corresponding to require- 
ments for getting strong decision rules. The usefulness of this 
method is evaluated in an experiment. 



1 Introduction 

The one of purposes of rule discovery systems is to extract from data sets in- 
formation patterns and regularities which are interesting and useful to different 
kinds of users. Discovery of such decision rules is a difficult problem depending 
on the context of application and requires close interaction with the user as he 
has to define requirements to the derived knowledge, should be able to direct 
the knowledge discovery process and validate its results (cf. [5]). Such postulates 
led the author to introduce the algorithm Explore [7] that induces all decision 
rules which satisfy user defined requirements. This algorithm takes into account 
requirements related to the various criteria of rule evaluation, e.g. the strength of 
the rules (representing the relative number of learning examples supporting the 
rule), the length of condition part of rules, the level of discrimination. In current 
experiments with the algorithm we focused mainly on the strength of decision 
rules (for more details see [7]). The motivations for discovery all strong rules 
according to the user’s requirements are also typical for data mining algorithms 
(cf. Apriori algorithm [1] for mining association rules in large databases - which 
are in fact another kind of rules than considered in our approach) . 

The first version of the algorithm Explore was limited to analyze only data 
sets defined using qualitative attributes. The continuous attributes could be han- 
dled by discretization techniques which convert them into discrete attributes. 
There are already proposed several discretization methods which are applied as 
a preprocessing step before the phase of rule induction (see, e.g. reviews in [2,3]). 
The main difficulty with these methods is that they determine discretization 
independently from the underlying concept of applied further rule induction al- 
gorithm. It is possible that the set of discretized subintervals being candidates 
for elementary conditions may not satisfy requirements for getting strong rules. 
For instance, in some specific data sets the discretization may produce a very 
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limited numbers of subintervals while in other cases they could give too much 
week conditions. As a result the algorithm may be unable to discover sufficient 
number of strong decision rules. Therefore, it is necessary to look for yet an- 
other discretization technique that should be more deeply integrated with the 
algorithm Explore and its requirements. 

The aim of the following paper is to extend the algorithm Explore by in- 
troducing a phase of direct discretizing continuous attributes while creating el- 
ementary conditions. The method should try to generate such conditions that 
correspond to requirements for extracting the strongest decision rules. The pro- 
posed method will be compared experimentally with the method of preliminary 
discretization to evaluate which of them allows to induce better rules. 

The paper is organized as follows. The next section describes shortly the 
algorithm Explore. Then, the method of handling continuous attributes is intro- 
duced. Summary of results (restricted due to the paper size) is given in section 
4. Discussion of the results and conclusions are drawn in the final section. 

2 Algorithm Explore as a tool for inducing strong 
decision rules 

We give only basic information about the algorithm Explore (for more details 
see [7]). The decision rules are iteratively induced for each decision class. If the 
input data contain inconsistencies they can be handled by means of the rough set 
theory^ i.e. exact (certain) and approximate (possible) decision rules are induced 
from lower and upper approximations of decision classes [6,9]. Another way of 
handling inconsistencies is to induce partly discriminating decision rules from 
each decision class, i.e. such rules besides learning examples from a given class 
could cover a limited number of examples from other classes (similar motivations 
are used in variable precision rough sets model [10]). The second way will be used 
in our experiment. Below we introduce some formal definitions. 

Let (D, A U {d}) denote the information system where D is a finite set of 
objects, A is a finite set of condition attributes^ d ^ A is a distinguished de- 
cision attribute that defines partition of objects into a set of decision classes 
Ti, l2, • • • 7 Let K will represent the decision concept to be described by rules 
(i.e. decision class Yj or its approximation depending on the way of handling 
inconsistencies). An elementary condition c is defined as an expression (a rel v) 
where a G A and v is its value (or a set of values) and rel stands for relational 
operator e.g. =, <, >, G. Let C be a conjunction of q elementary conditions, i.e. 
C = Cl A C2 A • • • A Cg. Then [C] is the subset of examples which satisfy the 
conditions represented by C . Considering the concept K to be described, this 
subset is divided into the positive cover \C]t^ = \C]n K and the negative cover 
[C]], = [C]n{U\K). 

A decision rule r is an assertion of the form ifR then K , where it is a minimal 
conjunction ci Ac 2 A* • • Ac^, satisfying [R\^ ^ 0 . A rule r is discriminant (exact) 
if [R]f^ = 0 otherwise is partly discriminant. The rules are characterized by two 
measures. The strength of the ruler is defined as Strength[r) =| [R]^ \ / | [K] |. 
The level of discrimination D{R) is defined as | [R]^ \ / \ [R] |. 
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In the algorithm Explore the search for rules is controlled by parameters 
called stopping conditions SC that reflect the user’s requirements. As the main 
attention is put to the strength of the rules, the definition of SC is connected 
with determining the threshold value for the minimal strength of the conjunction 
being candidate for the condition part of the rule. If its strength is lower than SC 
it is discarded otherwise it can be further evaluated. Additionally, one can define 
a threshold d expressing the minimum value of the level of discrimination D[R) 
of the rules to be generated. The algorithm Explore is based on a breadth- first 
strategy which generates rules of increasing size starting from the shortest ones. 
The main part of the algorithm is presented in pseudo-code below: 

Procedure Explore( SC: stopping_conditions; d: discrimination threshold; 

L: list _of_condit ions; var TZ: set_of_mles) 

begin 

; 

for each condition c from list L do 

begin 

if [c]+ = 0 or c satisfies SC then discard c; 
if Dfi) > d then 7Z ^ TZU {c} and discard c 

end; 

form a queue Q with all the remaining elementary conditions ci , . . . , Cn 
(ordered according to the decreasing strength); 

while Q fi $ do 
begin 

remove the first conjunction C from the queue Q; 

generate the set C of all the conjunctions C A Ch+i ? C A Ch+ 2 , . . . , C A Cn 
where h is the highest index of the condition involved in C; 
for each C' ^ C do 

begin 

if [C'Yfi = 0 or C' satisfies SC then C ^ C\{C'}; 

if D[c) > d then 

begin 

if C is minimal then 7Z ^ 7Z U {C'}; 

C^C\{C'} 

end; 

end; 

place all the conjunctions from C at the end of the queue 

end 

end 

The crucial issue refers to the creation of the list L of allowed elementary con- 
ditions. In the previous experiments [7] only nominal or preliminary discretized 
attributes were considered. For given decision concept A, the elementary con- 
ditions were created as at tribute- value pairs [a = v) such that [{ci = v)]^ 0. 

As for considered discrete attributes the list of their values is rather small, these 
conditions are quite fast to detect by scanning respective attribute values for 
objects in the input information system. 

Let us stress that the Explore algorithm guaranties getting the set of all deci- 
sion rules that satisfy the user’s requirements. As this perspective is quite differ- 
ent from the rule learning methods that induce the minimum set of classification 
rules (as e.g. C4.5, or LEM2), the algorithm Explore is more demanding from 
the time and memory complexity point of view. Its computational complexity 
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is exponential in the worst case. However, in practice the users are interested in 
discovery strong rules of small size. According to the way the search is performed 
(see the concept of the ordered queue) such rules are induced first. Therefore, if 
the user sets proper stopping conditions it efficiently reduces the computational 
costs. 

3 Discretizing continuous attributes while creating 
elementary conditions 

We propose to discretize continuous attributes directly at the moment when 
elementary conditions are created. As decision rules are mainly evaluated by their 
strength and secondly by their level of discrimination, the searched conditions 
should cover the largest number of examples from the given decision class and at 
the same moment cover the smallest number of examples from other classes. If 
one starts from strong enough and partly discriminating elementary conditions, 
the algorithm Explore performing its breadth-first search strategy could combine 
to sufficiently strong rules. Due to the strength requirements, the discussed below 
method will be called Maximal Strength Partitioning and denoted shortly MSP. 

The created elementary conditions c will be evaluated by two measures: 
their strength Strength[c) and level of discrimination D(c). Similarly to the 
Explore requirements these measures are controlled by two threshold parame- 
ters: Min.Strength and MinMiscr, 

As the Explore generates rules iteratively for each decision class, the MSP 
method also generates elementary conditions independently for each class. The 
generation of selectors is done locally for each attribute. The conditions are 
represented in the form either [a < x) or [a > x) where a is attribute and 
X is a cut point. The candidates for the cut point are locally scanned for the 
range of each continuous attribute. It means that values of the attribute for 
objects in input data are sorted in the increasing sequence. With each value it 
is also stored information about the decision class assignments of objects having 
this value. Candidate cut points are computed as mid-points between successive 
value points in the sorted sequence. We consider only mid-points between points 
that are characterized by the change of class assignment. If for candidate cut- 
point X any of conditions (a < x) or (a > x) satisfies both criteria Strength 
> MinMtrength and D(c) > Min_ Discr^ it is temporary added to L - the list 
of allowed conditions for the algorithm Explore. The time complexity of this 
technique is linear in 0(nm) where n is a number of examples, and rn candidate 
cut points in the worst case restricted by the number of values for a given 
attribute. 

If the list L at the end of discretizing the attribute contains too many poten- 
tial conditions it is possible to restrict their number. It needs to define the input 
parameter called Maxirnurmconditions. Eor its given value the best conditions 
from the list L are selected. Half of them is chosen according to the criterion of 
the highest value of discrimination level while the rest according to the highest 
value of the strength. This differentiation results from the possible trade off be- 
tween both criteria (i.e. the conditions having the highest discrimination level 
may not be the strongest ones). 
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If the list L for given requirements contains less elementary conditions than 
the required Maxirnura.conditions number it is possible to extend their number. 
We propose to additionaly use minimal entropy partitioning method (ME) [4] 
that preliminary divides the range of values of the attribute into two ’purer’ 
subintervals. Then, cut points inside these subintervals are scanned using our 
MSP approach. Due to preliminary sub-division the tested conditions are now 
represented in the form (a G [x^b\) or (a G [b^x\) where b is boundary generated 
by ME and x is a tested cut-point. If it is possible to identify the conditions that 
satisfy the tests for Min.Strength and Mm_ Discr they are added to L the list 
of allowed conditions. 

4 Experiment 

In our experiment, we compared the proposed discretization method MSP to 
an approach where Minimal Entropy Partitioning method [4], denoted as ME, 
is used in preprocessing phase. The ME method is one of the most often ap- 
plied approaches to discretize attributes in preprocessing phase before the rules 
induction. Moreover it was chosen because of some similarity to scanning the 
histogram of attribute value proposed in the MSP method. 

The aim of the comparison was to evaluate which of the methods allows 
the algorithm Explore to produce the better set of decision rules. The average 
strength of the rule was the main criterion (the higher the better). We also 
analyzed the number of rules and as a supporting criterion we used the classifi- 
cation accuracy estimated in the 10- fold cross validation test. In the experiment 
we used three data sets from U.C. Irvine repository [8], i.e. hank^ hupa and iris. 
The fourth data buses was coming from our previous experiments with technical 
diagnostics. Each of the data sets was defined using continuous attributes only. 
Let us shortly characterize them: hank concerns two decision classes (cardinal- 
ity of objects in classes 33/33) described by 5 continuous attributes; hupa - two 
classes (201/85 objects) described by 6 attributes; buses - two classes (46/30 
objects) by 8 attributes and iris - three classes (50/50/50) by 4 attributes. 



Table 1. The characteristics of sets of elementary conditions obtained by MSP 
and ME discretization methods for BUPA data set; */* denotes results in each 
decision class, ’-’ means that it was not possible to obtain any elementary con- 
ditions. 



Min_Discr 

[%] 


MS 


^ discretization 


ME discretization 


no.cond 

[%] 


av_strength 

[%] 


av.discr 

[%] 


no.cond 

[%] 


av_strength 

[%] 


av.discr 

[%] 


30 


35/22 


81/73 


45/63 


7/8 


78/66 


48/59 


40 


35/22 


81/73 


45/63 


6/8 


80/66 


47/59 


50 


11/20 


50/70 


52/63 


2/8 


54/66 


56/59 


60 


5/18 


21/59 


66/65 


-/3 


-/59 


-/69 



Eor the proposed MSP discretization method we decided to test several values 
of control parameters. We assumed that the maximal number of elementary 
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Table 2. The Sets of elementary conditions created for all data sets (Min_Str = 
10% and Min_Discr = 30%). 



Data 


MSP discretization 


ME discretization 


no.cond 


av .strength 


av.discr 


no.cond 


av .strength 


av.discr 


Bank 


22/19 


88/90 


80/76 


5/5 


89/78 


76/80 


Bupa 


35/22 


81/73 


45/63 


7/8 


78/66 


48/59 


Buses 


14/20 


91/89 


88/80 


9/8 


89/86 


88/90 


Iris 


11/20/16 


81/88/86 


77/49/65 


4/4/4 


88/84/97 


91/46/48 



conditions for one attribute will be restricted to 10. We tested systematically 
the following values of Min.Strength for elementary conditions: 10%, 20%, 30%, 
40%, 50%. For each of them we checked the following values Min_Discr: 30%, 
40%, 50%, 60% and 70%. 

For all data sets we noticed that various values of Min.Strength did not influ- 
ence too much the quality of created elementary conditions, assuming only that 
the strength is defined as at least 10%. Moreover, the values of minimal level of 
discrimination Min_Discr did not influence the created elementary condition for 
Bank and Buses data while for two other data Bupa and Iris increasing their 
value over 50% led to difficulties in getting a sufficient number of strong enough 
elementary conditions. This tendency is illustrated by the results for Bupa data 
presented in Table 1. In Tables 1 - 2 we will use the following abbreviations: 
no_cond - number of elementary conditions, av^trength - average strength of 
elementary conditions [%] , av_discr. - average level of discrimination of elemen- 
tary condition [%]. The summarized results evaluating quality of elementary 
conditions for all data sets are given in Table 2 - they refer to the parameters 
Min_Strength = 10% and MinJJiscr = 30%. 



Table 3. The characteristics of decision rules; */* denotes results in each decision 
class, means that it was not possible to obtain any decision rule 



Data 

set 


MSP discretization 


ME discretization 


no .rules 


av_strength 

[%] 


av.discr 

[%] 


accuracy 

[%] 


no .rules 


av.strength 

[%] 


av.discr 

[%] 


accuracy 

[%] 


Bank 


45/34 


87/84 


94/95 


91.67 


2/3 


84/74 


95/96 


92.33 


Bupa 


3/11 


16/17 


85/91 


62.85 


2/- 


28/- 


85/- 


27 


Buses 


11/16 


94/88 


96/96 


98.57 


8/6 


92/85 


97/95 


97.56 


Iris 


2/11/6 


72/71/78 


99/94/92 


94.14 


3 /-/- 


85/-/- 


99/-/- 


32.34 



These sets of elementary conditions were the basis for using the algorithm 
Explore. We tested four different stopping conditions SC: 15%, 20%, 25%, 30%. 
Only the first threshold led to satisfactory results, in particular for higher values 
we met difficulties with identifying strong decision rules from conditions pro- 
duced by ME discretization. We also noticed that it is interesting to look for 
partly discriminant rules (with the requirement to the level of discrimination 
85% - 95 %) as strictly discriminant rules were weaker and more numerous. Ta- 
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Table 4. Comparison of decision rules induced by Explore and LEM2 algo- 
rithms. 



Data 

set 


Explore 


LEM2 


no .rules av_strength accuracy 


no .rules av .strength accuracy 


Buses 


27 91 98.57 


3 93 98.7 


Bank 


79 85.5 91.67 


4 83.1 93.9 


Bupa 


14 16.5 62.85 


2ME 9 3.05 52.4 

kME 30 2.3 62.85 


Iris 


37 77 94. 14 


2ME 2 - 33.33 

kME 14 22 94.33 



ble 3 summarizes information characterizing discovered set of rules - the rules 
were induced for minimal strength SC = 15% and d=85%. The abbreviations 
used in Table 3 refer to average strength and discrimination levels of rules; ac- 
curacy means classification accuracy estimated in 10- fold cross validation test. 
Let us notice that for ME discretization it was not possible to induce decision 
rules for some classes in case of Bupa and Iris data sets. 

Then, we compared the satisfactory set of rules induced by the algorithm 
Explore with rules obtained by means of the classification-oriented rule induction 
system LEM2 [6] which was used on elementary conditions prepared by ME 
preliminary discretization phase. As this discretization was quite weak for Bupa 
and Iris data sets we additionally used its version with larger numbers of discrete 
subintervals (denoted by kME). Comparative results are given in Table 4. 

5 Discussion of results and final remarks 

Let us summarize the results of the experiment. The introduced MSP discretiza- 
tion method produces greater number of elementary conditions characterized 
by higher average strength than the typical preprocessing discretization method 
based on minimal entropy partitioning (ME). It could be considered as the better 
basis for the further induction of decision rules. This was particularly observed 
for difficult data sets like Bupa. 

The MSP method has to be parametrized by two parameters: minimal strength 
and minimal discrimination level of the elementary condition. We noticed for an- 
alyzed data sets that good MSP results are quite stable while changing these 
parameters. On the other hand increasing these values for filtering ME discretiza- 
tion deteriorates the quality of elementary conditions for two data sets Iris and 
Bupa. It seems also that minimal level of discrimination has greater influence 
on the final results assuming that minimal strength is not lower than 10%. Al- 
though both parameters could be tuned by the user depending on the context 
of application, the experiments indicates the default values for them: strength - 
10% and discrimination level - 30%. 

The MSP discretization helped the Explore in discovering more decision rules 
characterized by higher average strength. In particular, its advantage could be 
noticed for more difficult data sets (Bupa and Iris) where ME pre- discretization 
failed in discovery strong decision rules. 
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Finally, the comparison of rules derived by Explore and LEM2 algorithms 
showed that by using of the former algorithm it was always possible to discover 
set of better rules according to the criteria: number of rules, average rule strength 
without decreasing too much the classification accuracy. 
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Abstract. In a rough set approach to knowledge discovery problems, 
a set of rules is generated basing on training data using a notion of 
reduct. Because a problem of finding short reducts is NP-hard, we have 
to use several approximation techniques. A covering approach to the 
problem of generating rules based on information system is presented 
in this article. A new, efficient algorithm for finding local reducts for 
each object in data table is described, as well as its parallelization and 
some optimization notes. A problem of working with tolerances in our 
algorithm is discussed. Some experimental results generated on large 
data tables (concerned with real applications) are presented. 



1 Introduction 

Rough set expert systems base on the notion of a reduct ([7], [8]), a minimal 
subset of attributes which is sufficient to discern between objects with different 
decision values. A set of short reducts can be used to generate rules ([!]). A prob- 
lem of short reducts generation is NP-hard, but an approximate algorithm (like 
the genetic one described in [9], [4] and implemented successfully - see [6]) can 
be used to obtain reducts in reasonable time. On the other hand, rules generated 
basing on reducts are often too specific and cannot classify new objects. Another 
types of reducts have been considered to improve efficiency on new objects (see 
[2]). One of the methods is to calculate reducts basing on a single object. 

Let A = (U, A U {d}) be an information system (see [8]), where U - set of 
objects, A - set of attributes, d - decision. 

Definition: A local reduct R{oi) C A (or a reduct relative to decision and 
object Oi Oi \s called a base object) is a subset such that: 

a) V Oj G U , di^Oi) ^ d{jDj) V 3 G R. a]^{oi) ^ a]^{oj) 

b) R is minimal with respect to inclusion. 

A classical reduct will be referred to as global reduct. A rule generated by a 
local reduct is concerned with the base object and may not recognize any other 
object from U. To assure that a set of rules will recognize (at least) all objects 
from the training set, we have to generate a local reduct for every object. A 
simple algorithm checking whether a subset is a local superreduct works at a 
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time complexity of 0{mn): we have to compare our base object with all other 
objects and check whether condition a) (see local reduct definition) holds. It takes 
O(mn^) time to do this for all objects (when we are looking for a local reduct 
for every object), where n = number of objects, m = number of attributes. This 
time complexity is not acceptable for large data tables. A fast approximation 
algorithm for local reducts generation is presented in the next section. In sections 
3 and 4 some related topics, concerned with parallelization of the algorithm and 
dealing with tolerance are discussed. In section 5 some experimental results are 
presented. 

2 Covering algorithm 

An algorithm presented below realizes the following objective: assuming the 
information system is consistent, find a family of subsets i^i, i?2v • • Rk such 
that for any object Oi from U at least one Rj is a local reduct (we will say, 
that Rj covers Oi). We will look for possibly small family i^i,. . . Rk^ i.e. we will 
prefer these subsets which cover possibly many objects. We assume, that these 
subsets reflect regularities in data and generate more general rules - it means 
better classification of new samples and less memory required to store rules. 

1. Let a be a random permutation of attributes. 

2. Let R = A and fVi,. . .Nn - a table of numbers of local reducts found for 
each object. Set Nj =0. 

3. Test whether is a local reduct for any object. If so, increment Ni for these 
objects and store rules. 

4. Let R = R — ai^ where ai - the first attribute from R. Calculate a number Mi 
of these objects, for which is a (super)reduct, and which are not covered 
by reducts found previously. Let R = R^ ai. 

5. Continue step 4. with the next attribute from R. Finish after collecting 
numbers Mi for all attributes. 

6. Find the maximal number among if there are more than one such a 
number - get the first one with respect to the permutation a. Let aj - an 
attribute associated with this maximum. Let R = R — aj. 

7. Continue from 3. until R is empty. 

8. If there is at least one uncovered object - let R = continue from 4. 

Lemma. The algorithm described above generates a covering for all objects 
in at most n = \U\ cycles (by ’’cycle” we mean one sequence of steps from 2 to 
8 ). 

Proof. We will prove, that in one cycle at least one uncovered object is 
covered by newly produced reduct. When we find out, that a set is a local 
superreduct for a number of objects not covered so far in step 6. of the algorithm, 
there are two possibilities: a) all Mi are equal to 0, but it means that is a local 
reduct for all these objects (because it is a superreduct and none of its subsets is 
a superreduct) so we have covered some new objects; b) there exists an Mi >0, 
i.e. at least one subset of is a superreduct for Mi objects not covered so far - 
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we continue from step 4. If our subset R has two attributes, possibility b) means, 
that there exists a local reduct with one attribute (a superreduct with only one 
attribute must be a reduct). So, in one cycle (starting from R = which is a 
local superreduct for all objects) we either realize possibility a) or, in the worst 
case, achieve a reduct containing only one attribute. 



We need to have a method of determining whether a subset is a superreduct 
to complete our algorithm. 

1. Sort a table of objects using attribute values (for attributes belonging to R). 

2. Scan the table of sorted objects one by one. Our objects are divided into 
groups with equal values on attributes (abstract classes of indiscernibility 
relation generated by i^, see [8]). 

3. If a group has an uniform value of decision - it means that is a local 
superreduct for objects belonging to this group. If not - R is not a local 
superreduct for these ones. 

Since we may use a fast method of sorting, our algorithm has the complexity 
of mn log (n), where n = number of objects, m = number of attributes. 

3 Parallel algorithm and practical notes 

The algorithm described in the previous section covers all objects by at least 
one reduct. On the other hand, the more reducts for each object we find, the 
more rules we can generate. Since the algorithm is deterministic for a given 
permutation o, we have the following possibilities: 

1. We may choose a set of p permutations ai,. . . ap and generate p cover- 
ings using this algorithm in parallel on p machines. When permutations are 
different, the obtained coverings usually are different too. 

2. We may do the same on one machine in sequential way. In this case we can 
perform an additional optimization: at each stage of algorithm we look for 
covering for only these objects which are covered in minimal degree during 
the previous stages. 

3. We may use an evolutionary algorithm to find the best permutation - i.e. 
the permutation generating a covering using a minimal set of possibly short 
reducts. An order-based genetic algorithm (see [3], [10]) can be used in case 
of sequential as well as parallel computations. 

When we check whether a subset is a local superreduct for any object, we 
can easily check whether it is a global superreduct (i? is a global superreduct 
4=^ is a local superreduct for all objects). On the other hand, the algorithm 
of finding global reducts described in [4] and [11] uses a structure called ” reduct 
store” containing all known global superreducts of information system. Thus, we 
can check whether R is known as a global superreduct before we start to sort 
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our object set, as well as we can add R to this structure when we find out that 
is a global superreduct. Moreover, the same structure can be used by many 
agents in parallel implementation (each agent calculates covering for different 
permutation) and by one specialized agent calculating global reducts. 

4 Tolerance 

We use local reducts to generate rules which are more general than these gener- 
ated basing on general reducts. On the other hand, these rules may still be too 
specific - especially when we work on numerical data. One of the ways to manage 
this problem is to use a discretization technique (see e.g. [5]). Alternatively, we 
can use a tolerance measure^ which allows us to treat two different (but close) 
values as equal. 

An algorithm presented in section 2. can be easily adopted to this new situ- 
ation, in case a tolerance relation is transitive. In this case we can sort a set of 
objects and divide it into classes of this relation - then continue with the stan- 
dard algorithm. Alternatively, we can initially replace attributes^ values with 
their representatives (found by e.g. methods of scalar quantization or discretiza- 
tion). 

Unfortunately, many tolerance relations are not transitive, and we cannot 
simply sort data and check adjacent pairs of objects. More research is needed to 
use our algorithm in this case. 

5 Experimental results 

The algorithm described in section 2. was implemented and tested on several 
information systems used in real applications - results are shown in the table 
presented below. 



Size: objx attr 


#red 


Time [sec] 


4,492 X 36 


1 


13 




10 


49 


24,000 X 10 


1 


25 


22,000 X 27 


1 


90 




5 


1600 


47,000 X 28 


1 


360 



Calculations were performed on Pentium-200 machine. The first data set is 
the ’’Satellite image” database, the second is the ’’Shuttle” database. The column 
”#red” indicates how many reducts (at least) we found for each object. 

The results show, that our new method is relatively fast, even for large data 
tables (finding local reducts for each object using the previous methods takes 
many hours for tables with number of objects greater than 20,000), especially 
when we are interested in just a covering of objects. Actually, when we cover 
objects by at least one reduct, an average number of reducts covering an object 
is usually equal to about 3.5. 
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6 Conclusions and future work 

We have presented a covering approach to rule generation problem and an ef- 
ficient algorithm for finding local reducts for a set of objects. A computation 
time for this algorithm is close to the time of global reduct finding ([4]). Our 
new method is fast, and it should generate more general rules - a comparison of 
efficiency of these rules generated in classical and a new way will be performed in 
the future. Another direction of future research is to implement and test an evo- 
lutionary algorithm (see section 3.) and a tolerance-based techniques. Moreover, 
a parallel system has been not implemented so far. 
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[ Abstract.] In this paper we give a syntactical answer to the 
following question: What do we actually know about a partial 
algebra when we know its set of weak or relative subalgebras 
with cardinal smaller than a fixed bound, if we do not have any 
information on how they are linked to each other within the 
algebra? 



1 Introduction 

The “roughness” of a theory consists essentially in the fact that, under some data 
system, objects can only be described approximately. Thus, different objects may 
be indistinguishable from each other by the means available in the system. This 
situation is made precise by the notion of indiscernibility. 

This paper is intended to pursue the idea of structural approximation for 
algebras. Given a weak or relative subalgebra of an algebra (total or partial), 
it can be seen as an approximation of the structure of the latter. This leads to 
the following notion of indiscernibility. Let /C be a class of algebras of a given 
signature. We define Sy^j{lC) as the class of all weak subalgebras of algebras in 
1C. Then for any class Ad of algebras we say that two algebras A and B in /C 
are Ad-indiscernible iff for every D G D G S^[A) if and only if D G 
Clearly, a similar notion can be defined for relative subalgebras, too. Thus two 
algebras are indiscernible when they cannot be distinguished from each other by 
the available approximations. 

In this paper we are concerned with approximations of an algebra A given 
by finite weak or relative subalgebras (or even by subalgebras of at most some 
fixed finite cardinality). Notice that such finite approximations are ubiquitous in 
A (each finite subset of the carrier of A supports at least one weak subalgebra 
and exactly one relative subalgebra of A), and they determine completely A, 
provided we know the way they glue (a partial algebra is the direct limit of 

This work has been partially supported by the KBN grant 8-TllC-OlOll, and the 
DGCIyT grant PB96-0191-C02-02 
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its directed system of finite weak, or finite relative, subalgebras [2, Cor. 4.4.7]). 
But what do we actually know about a partial algebra when we know its finite 
approximations (up to isomorphisms), if no information on how they are linked 
to each other within the algebra is available? 

We give here a syntactical answer to this question. We define the syntactical 
content of a type of finite approximations as, roughly, a set of formulas of the 
form “conjunction implies disjunction” that are ‘captured’ by these approxima- 
tions (see Def. 2 below) , and we determine it for weak and relative subalgebras 
(with a fixed bound on the cardinality of their carrier) of partial and total alge- 
bras. 

Notice that partial algebras are often used as models of objects appearing in 
soft computing, such as graphs, relational systems and data bases. For instance, 
a binary relation on a set can be understood as a partial binary operation given 
by the first projection defined only on the pairs in the relation. Our results 
allow us to characterize syntactically, for instance, the knowledge of all its weak 
subsystems (weak subalgebras of that binary algebra) with less than a fixed 
number of elements. Another way of looking at the problem considered in this 
paper is to ask what knowledge on an algebraic structure can be derived from 
finite experiments, which thus brings us close to machine learning. 

This note is born in part from the desire to better understand some of the 
results on similar problems obtained in [1], to be published elsewhere. 

To simplify things, we only deal here with partial algebras over finite (i.e., 
with finitely many operation symbols) and homogeneous (i.e., one-sorted) sig- 
natures, but the results we obtain in this case are easily generalized to more 
general cases; cf. again [1]. 

2 Preliminaries and notations 

For the convenience of the reader, in this section we recall the basic definitions 
on partial algebras, used in this paper (except for weak and relative subalgebras 
defined at the beginning of the next section); for any notion not defined here, as 
well as for more details about those defined, see [2]. In this section we also fix 
some notation and conventions to be used throughout the paper. 

We fix for the rest of this paper a signature U = (i?,?]), where i? is a 
hnite set of operation symbols and ?7 : i? ^ is the arity mapping. We set 
= |(^ ^ j7 I rj[(p) = n} for every n G IN. 

A partial U -algebra (an algebra^ for short) is a structure A = (A, (c/p^)^^^), 
where A is a set, called the carrier of the algebra, and for every p £ : 

^ j[ \q 0 ^ partial mapping with domain domc/?^ C We denote the 

class of all such algebras by Alg^. 

Given an algebra denoted by a capital letter in boldface type (A, B, etc.), we 
always denote, unless otherwise stated, its carrier set by the same capital letter 
in slanted type (A, i^, etc.). 

An algebra is finite when its carrier is finite. The cardinal |A| of a finite 
algebra A is the cardinal of its carrier. 
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An algebra A is total when is a total mapping, for every cp G i?; we 
denote by TAlg^ the class of all total algebras. 

Two algebras A = (A, and B = are isomorphic 

when there exists a bijection h : A ^ B [on isomorphism) such that for 
every cp G i7 and for every a, ai, . . . , G A, cp^(ai, . . . , a^(c^)) = a iff 

(f^{h{ai), = h{a). 

We fix henceforth a countably infinite set of variables A = {xi | i G IN}, 
disjoint from i7. The set T^(A) of (B-)terms with variables in X is defined as 
the least set T such that X U C T and, if (/? G i7 and ti, . . . , G T, then 
y(ti, . . G T. 

Given a term t G T^(A) and an algebra A, we define the (partial) term 
function A (where A^ denotes the set of all valuations v : X ^ A) 

as follows: 

— If t = G A, then t^(i;) = v{xi) for every v : X ^ A, 

— If t = (p G then t^(u) = for every i; : A ^ A if is defined^, and 
domt^ = 0 otherwise. 

— If t = c/?(ti, . . An) for some p G and terms ti, . . . ,t^, then v G domt^ 

iff V G (^)) ^ domcp^, and if i; G domt^ 

then t^(w) = (0, ■ ■ • ,tye)- 

Notice that the definedness and value of t^(i;) only depend on the images 
under v of the variables appearing in t. Moreover, if cp G then the term 

function associated to p{xi ^ . . . is (essentially) the operation : A'^ A. 

To simplify the notation, and unless otherwise stated, when we write in the 
sequel t^(u), we always assume that it is defined, i.e., that v G domt^. 

An existence equation^ an equaMon for short, is a pair (p,q) G Ti;(A)^ of 
terms, and will be written p q in the sequel. 

Given an algebra A and a valuation i; : A ^ A, the equation p ^ is 
satisfied in A w.r.t. i;, in symbols (A,u) |= p ^ q, when v G domp^ fl domq^ 
and p^(t^) = q^(i;). So, for instance, (A,u) |= p ^ p means simply that p^(t^) 
is defined; therefore, we denote the equation p p by dip. 

Using equations as atoms, and the connectives -i, V, A, ^, . . . (with their usual 
logical meaning), we can build up formulas and define their satisfaction in a 
partial algebra w.r.t. a given valuation; see [2, §7.1] for details. In this paper we 
are only interested in a special type of such formulas. 

A quasi- existence equaMon of type B is a formula of the form (A^e/ ^ ^ 

p ^ q with I a finite set. A disjunctive quasi- existence equation^ a V -equation 
for short, of type B is a formula of the form 

(A Pi « q*) ^ ( V Pi ~ 

iei jeJ 

with 1 and J finite sets; so, V-equations include, as special cases, quasi-existence 
equations and disjunctions of equations (taking | J| = 1 and / = 0, respectively). 

Mf V? G we say that is defined when : A° ^ A is total, and then we use 

the same symbol p^ to denote the image of this mapping. 
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To simplify the notation, we usually omit the brackets embracing the premise 
and the conclusion in V-equations. 

We denote by C the set of all V-equations of some previously fixed type. 

Let now <P = Aiei ^ ^ VjejPj ^ ^ V-equation. Then <P is 

satisfied in an algebra A w.r.t. a valuation v : tY ^ A fim symbols (A,r;) |= 
iff the following condition holds: 

If (A,u) \= Pi ^ c[i for every i G i, then (A, i;) |= p' q' for some 
j e J. 

So, (A, r;) ^ # iff pA(^) = every i E I but, for every j G J, either 

V ^ domp' , or u ^ domq' , or p' (u) q' (t?). 

Now, an algebra A (globally) satisfies a V-equation in symbols A |= 
when ( A, u) |= ^ for every v : A ^ A. 

It is clear that two isomorphic algebras satisfy exactly the same V-equations 
(as we will see later, the converse implication is false, even for total algebras). 

We say that an equation p q is a consequence of a finite set of equations 
{Pz ^ ^i}iei when the quasi-equation Aie/P^ q^ ^ p q is a tautology (i.e., 
it is satisfied by all algebras) ; it is equivalent to say that p q is deduced from 
{Pi « through Burmeister’s deduction rules for existence equations [2, 

§6.4.8], 

3 Main results 

We begin by recalling the definitions of weak and relative subalgebras. 

[ Definition 1.] Let A = [A, and B = [B, be two algebras^ 

with B C A, 

t) B is a weak subalgebra of A when^ for every Lp ^ if domc/p^ then 
b G doimp^ ami = (f^{b), 

a ;b is a relative subalgebra of A when it is a weak subalgebra and^ for every 
(f £ ffi ifbE domc/p^ Pi and (f^{b) G B then b G domc/p^. 

Notice that every subset B of the carrier of an algebra A supports (in princi- 
ple) many weak sub algebras of A, but only one such relative subalgebra, namely 
the greatest possible weak subalgebra of A supported on B. 

Given an algebra A, let A^^(A) and A^’^^(A) be the classes of all algebras 
of cardinal at most n that are isomorphic to weak and relative subalgebras of 
A, respectively. Let also A^(A) and Sf{A) be the classes of all finite algebras 
that are isomorphic to weak and relative subalgebras of A, respectively. These 
are the finite approximations of A we consider in this paper. 

[ Definition 2.] Let C be a class of algebras and let S be an algebraic opera- 
tor corresponding to some type of finite subalgebras (for instance^ one of those 
defined above), 

A set L of y -equations is the syntactical content of S for C when it is the 
greatest subset of L satisfying the following three conditions: 




412 



W. Bartol, X. Caicedo, F. Rossello 



/i)y If /\-^jPi ^ <^i ^ VjGjPj ^ ^ then C also contains every 

y -equation of the form Aie/ ^ ^ ^jeJ' Pj ^ ^'j J' ^ J (we say 

then that L is well-formed A 

/ii)y For every formula F E £ there exists a non-empty finite set Cg^[F) of 
finite algebras such that^ for every algebra A G 

A ^ F if f there exists some Aq G C§ ci^) ^ 

/iii)y For every finite algebra Aq; there exists a formula F§^{Aq) G £ such 
thaty for every algebra A G C; 

AoeS{A) tffA^4>sPAo). 

Thus, a well- formed set £ of V -equations is the syntactical content of S for a 
class C when it is the greatest such set such that, for every A G C, the knowledge 
of S[A) is equivalent to the knowledge of 

£{A) = {Fe £\A^F}, 

Indeed, notice that, for every A G C: 

— To know whether A satisfies a formula ^ G T, one has only to check whether 
some algebra in the finite set ^{F) belongs to S{A); 

— To know whether a given finite algebra Aq belongs to S'(A), one only has to 
check the non-satisfaction of ^^^(Aq) by A, 

In particular, the following result holds. 

[ Proposition 1.] Let £ be the syntactical content of the operator S for a class 
C of algebras. Then^ given any two algebras A,B G C 5-(A) = S{B) tff£{B) = 
L{A). 

Consider now the following definition. 

[ Definition 3.] Given a non-tautological V- equation F^ we shall call its com- 
plexity the greatest cardinal n[F) of an algebra A such that A ^ ^ but A' \= F 
for every strict weak subalgebra of A. We adopt the convention that tautological 
y -equations have complexity 0. 

Let £^'^^ be the set of all V-equations of complexity at most n, and, for every 
Act, set A(^) = TnT(^). 

Notice that the complexity of a V-equation F is always smaller or equal than 
the cardinal of the least initial segment^ of Ti;(tF) containing all terms appearing 
in it. 

^ A subset Y C is an initial segment when, for every p E 12 and ti, . . . G 

Ti:(T), . . t^(c^)) G Y implies ti, . . . t^(^) G Y, 
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[ Proposition 2 .] Let Lyj denote the set of all \/ -equations of the form 

V V( Y p'fe «c4) 

iel (ji,j2)GJ keK 

such thaty for every k £ 3p^,3q^ are consequences o//\^^jP^ 

i) For every n G JN^ is the syntactical content of for Alg^;, 
a ) jCw 'Is the syntactical content of S[^ for Algj ^ , 

[ Proof, ]We will prove only (i), since (ii) follows immediately. To do that, 
being clearly well-formed, we check points (ii) and (iii) in the definition of syn- 
tactical content, and then we show that is the greatest well- formed set of 
V-equations satisfying point (ii) therein. 

a) For every <P = Aig/Pi « q* ^ (3uh)€J ^'h ~ ^'h) ^ (VfeexPfe ~ 

qP in let C ^{n) {F) be a (minimal) set containing one, and only one, 

representative of every isomorphism class of algebras A' such that A! ^ and 
|A'| < n{<P) < n. This set is clearly finite (and it is empty iff ^ is a tautology). 
We will show that it satisfies the property required in Definition 2. 

Let A be an algebra such that A ^ and let i; : T ^ A be a valuation such 
that (A,i;) ^ i.e., such that pf^(t^) = qf^(^) for every i G i but p^^ (4 A 

q^^(^) f^^ every k £ K (notice that all p^^(u) and q^^(u) are defined because 
3p^ and 3q^ are consequences of Azg/P^ ^ ¥" ^(^^2) f^^ every 

(A, A) e J with ji p j2. 

Let V be the set of all variables appearing (explicitly) in the terms of 
and let A' be the least finite weak sub algebra of A containing v{V) and such 
that pA (t?) and qA (t?) are defined for every i E I. Then, for any valuation 
v' \ X ^ A! that coincides with u on D we have (A',i;') ^ hence A' ^ F, 
Since any strict weak subalgebra of A' satisfies we have that |A'| < k{F) 
and thus it has an isomorphic copy Aq in C {F). This shows that if 

A ^ ^ then there exists some Aq in C^(n) {F) fl A^’^^(A). 

Conversely, let A' be a finite algebra of cardinality less then k[F) such that 
A' ^ F and let A be an algebra containing A' as a weak sub algebra. Let 
i; : T ^ A' be a valuation such that (A',i;) ^ F: then pf^ (u) = (v) for 

every i e I but p^^ (t?) ^ q^^ (t?) for every keK (remember that 3p^ and 
are consequences of Aie/ ^ ) A ^(^^2) f^^ every (^17^2) ^ ^ 

with ji 7^ j2- 

Taking as i; : A ^ A the same valuation with target set A, we also have 
pA(^) = qA(^) f^^ every i e I (because they are already defined, and equal, 
in A'), P'/ (0 A q'A (i;) for every keK (because they are already defined, 
and different, in A'), and v{xj^) 7^ x[xj^) for every (ji^ J2) ^ d with ji A A? so 
(A,u) ^ F and consequently A ^ 

h) Every partial algebra has an empty weak subalgebra; thus, we can take as 
Fs(n) (0) the equation xi ^ xi. 
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Assume now that Aq is a non-empty partial algebra of cardinality m > 1 , 
with carrier Aq = {ai, . . . , a^}. Let i(Ao) be the set of equations 

7(Ao) = . -,XiJ « Xio I 99 e n > 0,io,.. .,in e {1, . . .,m}, 

= aio) 

and take as ^^(n) (Ao) the V-equation 

y^/(Ao) => \J Xj^ ^ Xj^ 

Notice that (Ao,t^) ^ Aig for any valuation v : A ^ Aq such that 

v{xi) = i = 1 , . . . , m. Therefore, if Aq G N^(A), then A ^ ^ig (^o)* 
Conversely, assume that A ^ Aig let i; : T ^ A be a 

valuation such that (A,i;) ^ Aig (^o)* Taking a[ = v{xi) for every i = 

1 , . . . , m, we have that 

- if . . - ,Xi^) « Xjg e -/(Ao), i.e., if 93 ^ 0 ( 0 ^^, . . . , then 

- if 1 < < J 2 < rn then a'-^ ^ a'-^. 

Let A'q — {a[, . . . , weak subalgebra Ag = (Ag, 

of A with 

domc/?^o = {(a'^ , • • • , | (a^ , . . . , G domc/?^°}, cp G > 0. 

Then the mapping h : Aq ^ Aq defined by h[ai) = a'-^ i = 1 ,. ..,m, is an 
isomorphism of Aq onto Aq, and therefore Aq G n2^^(A). 

Notice that the complexity of ^^(n) (Ao) is exactly | Aq| = rn. Therefore, 
for every algebra Aq of cardinality at most n we have constructed a V-equation 
#^(n) (Ao) of complexity at most n such that, for every A G C, Aq G A) 

iff A ^ Aig (^ 0)7 wanted. 

c) Assume that there is some well-formed set of equations T, not contained 
in C^\ and satisfying (ii) in Definition 2 w.r.t. the operator and the class 
Algj;. Then, C will contain a formula <P of the form Azg/P^ ^ ^ P ^ <4, 

where, say, 3p is not a consequence of the premise. 

Let A be a minimal algebra not satisfying Aiei ^ ^ P ^ 

therefore # either. Then by (ii) A has a finite weak sub algebra Aq that belongs 
to C ^{n) (#); however, A and thus also Aq has an extension satisfying in 

contradiction with (ii). □ 

In the proofs of the next propositions, we shall only give the corresponding 
sets c(^) V-equations ^^^(Aq); the proofs of the desired properties are 
similar to those in the previous proposition, already presented in detail. 
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[ Proposition 3.] a) is the syntactical content of for TAlg^, 
b) C is the syntactical content of for TAlg^, 

[ Proo/] Given an arbitrary V -equal ion <P = /\i^jPi ^ ^ VjejPj ^ 

in let C_^w^TAig^ (^) ^ ( minimal) set containing one, and only one, 

representative of every isomorphism class of algebras A' of cardinality at most 
n[<P) that do not satisfy <P for some valuation v : tY ^ Ah 

Moreover, given a finite algebra Aq, let ^^(n) (Aq) = ^|g (^o)- □ 

[ Proposition 4.] Let denote the set of all \/ -equations of the form 

/\ Pi « qi ^ ( V C ~ V ( Y p'fe « c4) 

iei jeJ keK 

where every tj is either a variable or a term of the form , . . . for 

some (f £ Y2y and^ for every k G 3lp^,3.q^ are consequences of ^ q^, 

a) For every n G JN^ (resp, is the syntactical content of Sd for 

Alg^^ (resp, TAlg^^/ 

b) Lr (resp, C) is the syntactical content of S[^ for Alg^j (resp, TAlg^^^^, 

[ Proo/.] For every <!> e take (P) = (<l>) and (P) 

= G^(n) {F). Moreover, for every finite algebra Aq: 

- = ‘^S<">,TAlgd®) = 

— If Aq has carrier Aq = {ai, . . . ,a^}, m > 1, then let i(Ao) be the set of 
equations associated to Aq as in the proof of Proposition 2, and let 

i''(Ao) = I A C > 0,io, • • • An c {1, . . . ,m}, 

{ai^ , . . . , od ^ domc/p^ or 

A dil 7 • • • 7 

Then take F^(n) .. (Aq) = F^(n) , (Aq) to be the V-equation 

y^i(Ao) ^ (\/^ (^o) ^ \/ ^ ^32) 

^<3i<32<m 
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Abstract. In this paper, we focus our attention on the representation of linguistic 
negation of nuanced information in knowledge-based systems. The linguistic negation 
presented here uses a compatibility level and tolerance threshold based upon neighbor- 
hood and similarity relations. Their combination allows us to choose the reference frame 
from which the possible values of a linguistic negation of A appearing in the statement 
is not A will be extracted. This new approach to negation takes into account linguis- 
tic analysis of negation and leads to intended properties in common sense reasoning. 



1 Introduction 

In this paper, we focus our attention on the representation of nuanced infor- 
mation expressed in affirmative form like “the weather is really very wet” or in 
negative form like “the weather is not dry” . Our main goal has been to create a 
new symbolic model dealing with this information, but more particularly with 
negative information referring to linguistic negation, and this, within the context 
of fuzzy set theory [27]. It is obvious that it is not easy to solve the problem 
of representation of nuances ([18,20,19]), modifiers ([28], [2,4], [3], [5], [7], [16], 
[13]), or linguistic labels ([25]) in terms of membership functions. In Section 2, we 
present the initial representation of nuanced information based on an automatic 
process defining the L-R functions ([10]) associated to nuances of basic properties 
([8,9]). Section 3 is devoted to the presentation of a new approach to linguistic 
negation which, (1) alleviates difficulties of logical negation based on fuzzy com- 
plementation ([27]) and of qualitative negation [25,16] used in symbolic models, 
and (2) improves the approach to linguistic negation proposed by Pacholczyk 
in ([18,17,20]). We first define a more general concept of linguistic negation de- 
pending on compatibility level p and tolerance threshold s'. These parameters 
define linguistic negation through neighbourhood and similarity relations. Then, 
a combination of p and s' values allows us to choose the reference frame, a set 
of nuanced properties, denoted as Neg^ ^(A), from which the possible values of 
linguistic negation of A will be extracted. Each element of Neg^ ^(A) is said to 
be a linguistic negation p- compatible with A with a tolerance threshold e. Then, 
we can define a reference frame subset, denoted as neg^ ^(A, x), which consists of 
the intended meanings of the linguistic negation p- compatible with A at x with a 
tolerance threshold e. This new linguistic negation leads to the intended meaning 
resulting from the analysis proposed by linguists. Finally, we present properties 
of this new linguistic negation. 
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2 The Initial Frame of Information Representation 



In many domains a part of knowledge is represented by facts, denoted as “x is 
A”, and rules, denoted as if “x is A” then “y is B”. So, their representation can 
be handled with Fuzzy Sets associated with respective properties. Moreover, this 
knowledge based upon fuzzy properties^ (1) can be expressed in natural language 
with the aid of nuanced expressions^ and (2) can refer to linguistic negations 
of properties. The knowledge base can be characterized by a finite number of 
distinct concepts Cp The set X denotes the objects (or individuals) belonging to 
the discourse universe. A finite set of distinct basic properties denoted as Pik^ 
defined in the same domain Dp is associated with each concept Cp The user 
applies linguistic modifiers to these basic properties to express his knowledge. 

- Example. A knowledge base can contain rules like Rl: if Jack is not small then 
he is visible in the crowd, R2: if the wage is not high then the summer holidays 
are not long, R3: if the weather is not wet then the tourist season is not bad. 
The user can introduce facts like FI: Jack is really very tall, F2: the wage is 
really low, F3: the weather is dry. 

The model proposed by Desmontils & Pacholczyk in [8,9] allows us to refer 
to affirmative information like “x is ia.re].j 3 V or negative information like “x 
is not ia.reij 3 V ik\ A property like “fam; 3 P^j^” which requires for its expression a 
list of linguistic terms is called a nuanced property. Let us denote as A/" the set 
containing the nuances of all basic properties P^j^. We have selected two sets of 
fuzzy modifiers: 

- The first one consists of translation modifiers: Mr= {extremely little, very little, 
rather little, moderately (0), rather, very, extremely} (Fig. 1). M 7 is supposed 
to be totally ordered by the relation: m^^ < m ^5 <=> a < f3. 

- The second one consists of precision modifiers: F 6 = {vaguely, neighboring, more 
or less, moderately (0), really, exactly} (Fig. 2). In the same way, Fe is supposed 
to be totally ordered: f « < {5 a < f3. 





Fig. 1. Translation modifiers 



Fig. 2. Precision modifiers 



3 Negation p-compatible with A with Tolerance 
Threshold e 

As already pointed out by Pacholczyk in [18,19,20], it is necessary to leave a 
representation of linguistic negation in terms of a one-to-one correspondence be- 
tween the elements of L and to turn towards a one-to-one correspondence between 
an element of L and a subset of L. More generally, any function associating with 
each nuanced property A a subset of nuanced properties defined on the same 
domain as A, will be called a multiset function. 
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Basing one’s argument on linguistic analysis (Muller [15], Culioli [6], Horn 
[12], Ducrot and Schaeffer [11], Ladusaw [14]), Pacholczyk has pointed out in 
[18,20] that when one asserts that “x is not A” then, (1) one rejects a reference 
to “x is A” , and (2) if necessary, one refers either to the logical negation of A, 
or to another property P different from A but defined in the same domain, or 
sometimes to a nuance fctm; 3 A of A, or finally to a new basic property denoted 
as not- A. 

The following examples illustrate these different cases: Asserting that “Smith 
is not guilty”, Sherlock Holmes can only reject the hypothesis “Smith is guilty” ; 
Saying that “my hat is not blue” , the clown can signify that his hat has another 
colour ; The statement of an employer “the wage is not very high” can receive as 
an intended meaning “the wage is really low” ; Saying “this wine is not bad” the 
restaurant owner can signify to his customer that “it is not-bad ”. So, “not-bad” 
is a new basic property associated with wine quality. 

The statement “x is A” may be interpreted as one of the following statements 
([24]): “x is 0A”, “x is really A” or “x is more or less A”. In the set Fe of 
precision modifiers, we can put: F6i={more or less, 0, really} and Fe 2 = {vaguely, 
neighboring, exactly}. So, linguistically speaking we put: “x is A” {“x is f«A” 
with fa G Fei}. In other words: Rejection of “x is A” ^ Rejection of {“x is f«A” 
with fa G Fei}. Moreover, the assertion “x is not A” can in addition imply a 
reference to a nuanced property P in order to define “x is P” as the intended 
meaning of “x is not A” . Intuitively the speaker understands this real difference 
in terms of a weak neighborhood between the membership degrees to A and 
P for their significant values, that is to say: /x^(x) or /xp(x) is rather close to 
1 ^ pa{^) and /ip(x) are weakly neighboring. It is obvious that the expressions 
“rather close to” and “weakly neighboring” can receive different translations, 
each of one defining a linguistic negation by the strength with which the assertion 
is denied. ThaVs why we have introduced a first parameter p, with 1 > p > 0, to 
take into account this negation strength. 

Let us now talk about the notions of neighborhood of membership degrees 
and of fuzzy set similarity in common sense reasoning. Commonly they are un- 
derstood as graduated relations between entities. The degree of neighborhood 
must be at a maximum for identical entities ; it doesn’t depend on the order, and 
it can decrease by propagation. So, these relations are weakly transitive. We can 
find in ([1], [16], [23], [29], [26]) different approaches to such similarity relations. 
Let us point out a final detail: the strength of linguistic negation depends also 
on the precision of each degree of neighborhood or similarity. In other words, 
a speaker attaches a tolerance threshold to the precision of each degree. ThaVs 
why we use a second parameter^ denoted as s', with p > s’ > 0, to define the 
negation strength according to a tolerance threshold. 

[ Definition l.]A neighborhood relation V is defined in [0^ 1] as follows: 

VNl; V(x^ x)=l (Refiexivity) ; VN2; V(x^ y)=^(Vy (Symmetry) ; 

VN3; V(x^ z)>T(V(Xy y)^ z))^ where T is a T-norrn (Pseudo-transitivity). 



[ Definition 2.]Given neighborhood relation V defined in 1]^ x and y are said 
to be (at least) a-neighboring if V(x^ y)> a. 
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- Example. V/,(x, y)=Min(x^/,y, y^L^) with if x<y else 1-x+y 

(Lukasiewicz’ implication) is a neighborhood relation. 

Let us now denote by the set of all fuzzy sets defined in the domain V, 

[ Definition 3.] Let V: x ^ [ 0 ^ 1 ]. Then V defines a similarity relation in 

T if and only if: SI; V(A^ A)=l (Refiexivity) ; S2; V(A^ B)=V(B^ A) (Symme- 
try) ;S3:V(A^ C)>T(V (A^ B)(V(B^ C))^ T being the previous T-norm (Pseudo- 
transitivity) 



[ Definition 4.] Given that the fuzzy sets A and B of tF are said to he (at least) 
a similar^ we denote this as V(A^ B) > iffS E(/i^(x), //^(x)) > o. 

In order to define the linguistic nuanced similarity of fuzzy sets, we can 
introduce a set of linguistic expressions L={ 6 ^i, . . . , 6 ^ 7 }={not at all, very little, 
rather little, moderately ( 0 ), rather, very, totally} totally ordered as follows: 
i>j Oi > Oj. Moreover, we suppose that they are defined as fuzzy subintervals 
([10]) of [0, 1] (Fig. 3). 

[ Definition 5.]Given V; two fuzzy sets A and B of tF are said to he Oi similar 
if and only ifi jxefia) = Maxj{jX 0 ^{a)} knowing that a = Max{6\V{A^ B) > 4}. 

Then, we can prove the following properties. 

[ Property i.]The linguistic similarity relation possesses the following properties: 
LSI: A is totally similar to A (Refiexivity) 

LS2: if A and B are Oi similar then B and A are 0 similar (Symmetry) 

LS3: if A and B are Oi similar and B and C are Oj similar then A and C 
are Ok similar with Ok > 0 ^ knowing that 0 ^ is such that (r(o;i, 02 )) = 
MsiXi{jX 0 fiT{ai^ a 2 ))}y oi = Max{4|V(A,R) > 4} and 02 = Max{4|V(R, C) > 
4} (Weak transitivity). 



[ Property There exists a multiset function n: A/" ^ V{ff) such that n(A) 
is the set of all nuanced properties associated with the concept defining the 
nuanced property A. 

This being done, we can go into the details of the new concept of linguistic 
negation. 

[ Definition 6 .]Let e such that: 0 < e < p < 1. The multiset function Neg^^^: 
M V{N) defined as follows: \/A G NjSP G n(A),P G Neg^^fiA) if and 
only if : Nl; P and A are Oi- similar with Oi <moderately^ N2; Pa{^) A P ^ 
V{pa{^):Pp{^)) < 1 - P + N3; Pp{x) > p ^ V {pp{x), pa{^)) < I - pF e:, 

is said to be the linguistic negation p- compatible with a tolerance threshold 
Moreover^ any P G Neg^^^[A) is said to be a linguistic negation p-cornpatible 
with A with the tolerance threshold e. 

We can specify the functions of parameters p and s’ (conditions N2 and N3). 
For given s’, we can illustrate the function of p within the application domain 
of linguistic negation of A. When p creases, its application domain, the strong 
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p-cut of A (resp. P) ([10]), decreases by inclusion in the domain D to the strong 
1-cut. So, when p > 0, the conditions N2 and N3 define the threshold p of min- 
imal compatibility between elements of D with A regarding linguistic negation 
of A. SOy p defines among possible reference frames of the negation of A in the 
domain the strong p-cut of A as the reference frame for linguistic negation 
p- compatible with A. For given p, the parameter y induces for any pp{x) > p, 
the local neighborhood condition V hp{^)) ^ 1 — p + y. Its translation 
implies a maximal threshold for the values of pp{x) (resp. pa(^)- Moreover, 
when y decreases, the local neighborhood decreasing to (1 — p), this threshold 
also decreases. In other words, y defines a maximal tolerance threshold to which 
we refer to modulate the precision of linguistic negation p- compatible with A. As 
a conclusion, the combination of p and e allows the choice of reference frame 
from which the possible values of linguistic negation will be extracted. 

- Example. Using V = Ul,P = d.97 and y = 0.3, we have collected in Figure 
4, Nego. 97 o. 3 (^^low”) the set of linguistic negations 0.97-compatible with “low” 
with tolerance threshold 0.3. 





Fig. 3. Linguistic similarity degrees Fig. 4. Plausible linguistic negations 
- Remark: In [18,17,20,19], the linguistic negation was based upon the previous 
particular neighborhood relation Vp. Moreover, the condition N2 was basically 
different: pa(^) = f A 0.67 + y ^ pp{x) < f — 0.67. It is easy to establish that 
this linguistic negation satisfies the new property N2. In other words^ our new 
approach to linguistic negation improves the previous models. 

The set Neg^ e(^) defines the reference frame from which we have to extract 
the values of the linguistic negation. This comes down to define explicitly a subset 
of Neg^ g(A), consisting of the intended meaning of this occurrence of “x is not 
A”. More formally, we can denote as (A, x) the pair representing “x is A”. The 
previous process leads to the definition of choice function : ff xX ^ V{ff)y 

which associates with (A, x) the set neg^ ^(A, x) of nuances accepted as intended 
meanings of “x is not A” . 

[ Definition 7.]Any F G nep^ g(A,x) is called an intended meaning in 2 at x 
of the linguistic negation p- compatible with A with tolerance threshold y. We say 
also that “x is P” is an intended meaning in I with the tolerance threshold y of 
the linguistic negation p-cornpatible of “x is A”. If no confusion is possible^ we 
simply say that P is an intended meaning of the linguistic negation of A at x. 

We can now present without proofs some properties of this linguistic negation. 

[ Property A] For many properties A, “x is A” does not automatically define the 
knowledge about “x is not A” . More precisely, for any A we only know that “x 
is P” can be an intended meaning of “x is not A” if P belongs to neg^ g(A,x). 
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[ Property P G Neg^ ^(A) then A G Neg^ e(^)- But, if P E neg^ ^(A, x) we 
cannot assert that A G neg^ ^{P^x). In other words, the double negation of A 
belongs to a frame reference of the linguistic negation, but the intended meaning 
of “x is not P” does not generally lead to “x is A” . 



[ Property 5,]Neg^^(A) = {P G n(A),/^ is less than moderately similar to 
A,/i^(x) (resp. pp[x)) = 1 ^ P(l, /ip(x)(resp. 

[ Property 6\]0 < s < p < l,Neg;L,£(^) ^ ^ ^^go,o(^)- 

[ Property 7,]0 < s' < p < 1, if Neg;^ e(^) ¥" 0 then Neg^ ^(A) ^ 0. 

[ Property 8]{p < p' ,e' < e < p} ^ Negi^o(a) C Neg^,^j,(a) C Neg^_g(a) C 
Nego,o(-4). 

[ Property 9]lf we choose the neighboring relation: Vp^x^y) = Min{x 
y^y then Neg;L o(^) e(^) generally not empty. 

Let us recall conditions satisfied by modifiers ([8,9]). 

HI: A and f^A are less than moderately similox^ if and only if, G Fe 2 y 
H2: The translation modifiers are defined in such a way that the resulting nu- 
ances cover the corresponding domain. 

We suppose that the properties fulfil also the following conditions : 

H3: For any basic property Pik^ there exists one translation modifier defining 
a nuance which differs from Pik for the significant values : 0 < 

pk < Pk,^ma e Mr,{'^^^PPik{V >Pk^ Pa{x) < fik with A = niaPik}. 

H4: For any concept C^, the associated basic properties differ basically for sig- 
nificant values: VA;,3o-fe,0 < < l,3lkX < 7fe < o-k,{^x, > o^k ^ Vj p 

k,pp,.{x) < jk}- 

Then, it is easy to prove the following result. 

[ Property id.] There exists neighborhood and similarity relations leading to a 
concept of linguistic negation taking into account all linguistic interpretations of 
“x is not A” . 

4 Deductive Process dealing with Linguistic Negation 

The presence of linguistic negations in the knowledge base does not generally 
modify the use of the existing deductive process. For example, we can suppose 
that for given p and £, the intended meaning of negations in the initial knowledge 
base lead to RT: if Jack is very tall then Jack is visible in the crowd, R’2: if 
the wage is low then the summer holidays are Qg {short, average}, R’3: if the 
weather is dry then the tourist season is not-bad. The precision modifiers give 
us: Jack is really very tall ^ Jack is very tall and the wage is really low ^ 
the wage is low. By using facts F1-F3, we can deduce that: Jack is visible in 
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the crowd, the summer holidays are short or average, and the tourist season is 
not-bad. 

By using classical systems, deductive process based upon logical negation 
cannot be applied to facts and rules which include linguistic negations. But, it is 
not the case with our approach to linguistic negation, since the same deductive 
process can be apply to equivalent rules referring only to affirmative information. 
Moreover, a lack of information about the exact intended meaning does not 
necessary prohibit the use of rules containing linguistic negation in its premise. 
So, our approach to linguistic negation improves the abilities in the management 
of knowledge base in that facts or rules can include linguistic negations. 

5 Conclusion 

We have defined a general concept of linguistic negation of nuanced property 
based upon neighborhood and similarity relations and depending on compatibility 
level and tolerance threshold, A combination of these values allows the choice of 
reference frame from which the possible values of linguistic negation of “x is 
A” are extracted. This new approach to linguistic negation can be presented 
as a generalization of previous models taking into account linguistic analysis of 
negation. It appears clearly that this approach to linguistic negation improves 
the abilities in the management of a knowledge base in that facts or rules can 
include linguistic negations. 
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Abstract. Rough set theory depending upon deterministic information 
systems or knowledge bases is now becoming a mathematical foundation 
of soft computing. In this paper, we pick up nondeterministic knowledge 
bases with incomplete and selective information. The both information 
are given as a set of attribute values, whose difference comes from the 
temporal concept. If the information is referring the past information 
then we see it incomplete information. On the other hand, selective infor- 
mation means that the real attribute value is not decided in a set, i.e., we 
can select the most proper value from this set. By introducing these two 
information into knowledge bases, we develop another framework for non- 
deterministic knowledge bases. Namely, we discuss question-answering, 
approximation, rough set concept and dependencies of attributes on this 
nondeterministic knowledge bases. 



1 Introduction 

Rough set theory is seen as a mathematical foundation of soft computing, which 
covers some areas of research in AI, i.e., knowledge, imprecision, vagueness, learn- 
ing, induction[l,2]. We recently see some applications of this theory to knowl- 
edge discovery and data mining[3,4,5]. The rough set theory historically seems 
to depend upon the information systems including information retrieval sys- 
tems[6,7,8]. As for information systems, the main purpose seems to establish the 
theory for quest ion- answering. We know some famous works. However rough set 
theory handles not only question-answering but also other several issues for soft 
computing, like approximation, dependencies of attributes and decision rules. 

In this paper we basically depend upon the definitions in [1,2]. Furthermore, 
we deal with a case that every attribute value for every objeet is unique and not 
multivalued. As for dealing with multivalued attributes, we need more research. 
According to [1], we use term ^knowledge bases ^ instead of ^information systems^ 
For these knowledge bases, we introduce the following incomplete information 
and selective information into them. 

(1) Ineomplete information: For objeet x, there is a real attribute value in a set, 
but we can not decide which is the real one for the lack of information. 
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(2) Selective information: For object x, we can select its attribute value from 
a set. In this case, the selection is done by every user of these knowledge 
bases. The issue for handling selective information is to find the most proper 
attribute value for every user^s purpose. 

The incomplete information causes modal concept ^possibility^ and ^ necessity \ 
and there exists unknown real attribute value. However in selective information, 
there is no modal concept and no real attribute value. This is the big difference 
of two information. If we see af filiation(torn, {management, development}) as 
incomplete information then we see that Tom’s affiliation is either management 
or development. Therefore, we see that af f Hi ation(torn, management) may be 
true. On the other hand, if we see it selective information then we see that we 
can select his affiliation from management or development. The user of this 
information will decide Tom’s affiliation by checking the expectation value in 
every case. Therefore, the introduction of selective information supports us to 
discuss the issues like planning and decision making, too. Here, the purpose is 
to find the most proper attribute value for every user. 

We call knowledge bases with two above information ^Nondeterministic knowl- 
edge bases ^ from now on. We mainly discuss the question-answering and varia- 
tional rough set theory in nondeterministic knowledge bases. 

2 Nondeterministic Knowledge Bases : NKBs 

Now in this section, we give some basic definitions according to [1]. Let KB = 
(X, A,V, f) be a knowledge base, where X is a finite set of objects, H is a finite 
set of attributes, V is the set -theoretical union of domains of attributes from A 
and / is a classification function such that f : X ^ A ^ V. In every KB, the 
value of classification function is definite. 

Then, we give the definition of nondeterministic knowledge base{NKB) by 
incomplete and selective information. We need to specify three kinds of informa- 
tion for every NKB, namely definite, incomplete or selective information. We 
name a set S = {d[definite),i{incornplete), s[selective)} state set. In this case, 
we define an NKB = [X, A,V, S, g) where ^ is a classification function which 
satisfies g : X ^ A ^ 2^ ^ S. We usually use the attribute value instead of a 
singleton set in case of definite information. 

Example 1. Let’s consider a party planning issue. We have to decide place and 
date for party. Here X = {planl,plan2, plan3}, A = {place, date, estirnated-cost} , 
^place — {place\ , place2} , Vdate — {fri,Sat,SUn} and ^estimated -cost — Set-of - 
natur al -number . As for the details, we know the following table by classification 
function g. This NKB± = [X,A,V,S,g) shows that we have three plans with 
incomplete and selective information. For example, planl shows we can reserve 
placel on Saturday or Sunday and the estimated cost is between 12 to 15, i.e., 

estimated cost is either 12, 13, 14 or 15. When we see this N K B± ns n knowledge 

base, we may think the following queries according to NKB±. 

(1) How much is necessary at least for selecting placel ? 

(2) How much is necessary at least for selecting Sunday ? 
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X 


place 


date 


estimated -cost 


planl 


placel 


[{sat, sun},s) 


([12,15],i) 


plan2 


placel 


fri 


([7,9],i) 


planS 


place2 


[{fri, sat, sun}, s) 


([5,7],i) 



Table 1. Knowledge base NKBi 



These are the aspect of question-answering in NKBi. However, we have an- 
other aspect in NKBi^ namely the aspect of planning. Every user has the most 
proper selection according to his criteria from this NKBi. For example, if the 
attendance is usually maximum on Sunday and a user wants to gather a large 
attendance, then he will select Sunday from Table 1. 

3 Question- Answering in N KBs 

Now in this section, we refer to question-answering in every N KB. In incom- 
plete information systems, two valuational concepts ‘may holcT and ^surely hold" 
come from the incomplete information[7]. However, NKBs have not only two 
valuational concepts but also concept of optimal selection. 

3.1 Extensions in Every N KB 

We first give definitions in every NKB. Let NKB = (X, A,y, N, y) be an non- 
deterministic knowledge base, where g : X ^ A ^ 2^ ^ S. In this case, we call a 
classification function g' which satisfies the following (1) and (2) an extensional 
classification function with selective information. 

(1) For every object x(g X) and attribute a(G A) which satisfy g{x,a) = 
fsubset-SV-of-Vfi), gfx^a) = (e/ement_o/_Ny, d). 

(2) For every object x(g X) and attribute a(G A) which satisfy g{x,a) = 
(e/ement_o/_y, d) or y(x,a) = [subset SV -of s) , gfx^a) = y(x,a). 

In the above (1), every incomplete information is removed. In this case, we 
call an NKB' = (X, A, E, {d, s}, y') an extension of NKB with selective in- 
formation. Of course, the extension of every NKB may not be unique. So, we 
define EXTs[N KB) = {N KB' \ NKB' is an extension of NKB with selec- 
tive information }. For NKB' = (X, A, E, {d, s}, y')(G EXTs{N KB)), we call a 
classification function y* which satisfies the following (1) and (2) an extensional 
classification function. 

(1) For every object x(g X) and attribute a(G A) which satisfy g'{x,a) = 
[subset-SV -of J/, s) , y*(x,a) = [element -O f SV,d) . 

(2) For every object x[e. X) and attribute a(G A) which satisfy g'[x,a) = 
[elernent-of J/,d), y*(x,a) = g'[x,a). 

Here, we call an NKB* = (X, A, E, {d}, y*) an extension of NKB via NKB'. 
WealsodefmeEXT[NKB,NKB') = {NKB*\ NKB* is an extension of NKB 
via NKB' } and EXT[NKB) = EXT[N KB , N K B'). 
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3.2 Question-Answering with Some Conditions in NKBs 
According to EXTg[N KB) and EXT{N K B^ N K B'), we give the following def- 
inition. Let g be a query, where g is a conjunction of atoms. Because, we depend 
upon pro log in implementation. We did not discuss the equivalence translation 
for queries like [8] yet. 

(1) For a query q, the query q holds in X KB' EXTg{N KB)) under selec- 
tive values, if there is an NKB"^{e EXT[N KB, N KB')) such that NKB^ 
satisfies q. 

(2) For a query q, the query q may hold in X KB, if there is an X KB' 
EXTs{XKB)) such that q holds in XKB' under selective values. 

( 3 ) For a query q, the query q surely holds in X KB, if q holds in every X KB' 
EXTs{XKB)). 

In this definition, there is no uncertainties for EXTs{XKB), so there is no con- 
cept of may and sure. In (1), our purpose is to select the AAT5 *(g EXT{XKB, 
XKB')) which satisfies query q. We have better to use term ‘constraints’ instead 
of term ‘query’. In (2) and (3), we are handling incomplete information, so there 
is concept of may and sure. 

3.3 Implemented Question- Answering System in X KBs 

Now we briefly show the implemented question-answering system for X KBs. We 
have also been discussing logic programs handling incomplete information[9]. We 
revised this prover for handling not only incomplete information but also selec- 
tive information, which can solve queries in the above (1), (2) and (3). The 
following is the preliminary prolog program for query interpreter. 

dmprob(X,CHYPO,AHYPO) :-functor(X,Y,Num) ,Y/==( ,) , ! ,dmatom(X,CHYPO , AHYPO) . 
dmprob(X,CHYPO,AHYPO) :-functor(X, ( , ) ,2) ,arg(l ,X,X1) ,dmatom(Xl ,CHYP0, AHYPO 1) , 
arg(2,X,X2) ,dmprob(X2 , AHYPO 1 , AHYPO) . 
dmatom(X,CHYPO, AHYPO) : -clause (X, true) , AHYP0=CHYP0 . 
dmatom(X,CHYPO, AHYPO) :-clause(X,Y) , Y/==t rue, dmprob(Y,CHYPO , AHYPO) . 

dmatomCX , CHYPO , AHYPO) : -hypo (Ml , M2 , X) , iccheck (hp (Ml , M2) , CHYPO) , add (hp (Ml , M2) , CHYPO , AHYPO) . 
dmatom(X,CHYPO, AHYPO) : -hypo (Ml, M2, (X: -GOAL)) ,G0AL/==true, iccheck (hp (Ml , M2) , CHYPO) , 
add(hp(Ml,M2) , CHYPO, CHYPO 1) ,dmprob(G0AL,CHYP01, AHYPO) . 

We omit the details of implementation for reducing the length of this paper. The 
details of the previous version of this prover is in [9]. 

4 Approximations and Rough Sets in X KBs 

Now in this section, we give definitions for the approximation and roughness 
in every XKB = {X,A,V,S,g). Let’s suppose XKB*{e EXT{X KB)) and 
R{c A). The XKB^ is a typical knowledge base, so we know that X'{c X) is R- 
de finable or R-rough in XKB* according to [1]. We use R-equivalence relation 
on X for deciding it. First we give a definition m X K B' EXTs{X K B)), where 
there is no concept of incompleteness. 

(1) For subsets X'[c X) and R{g A), X' is R-definaMe in XKB', if X' is 
R-definable m anXKB*{G EXT{N KB,N KB')). 
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(2) For subsets X'{c X) and R{c A), X' is R-rough in NKB' , if X' is not 
R-definahle m every XKB^{e EXT{N KB,N KB')). 

The above definition seems to be natural by the definition of selective informa- 
tion. Then we go to the definition in NKB^ which has the incompleteness. We 
need the concept ‘may’ and ‘sure’. 

(1) For subsets X'{c X) and R{c A), X' may be R-definable[or R-rough) in 
NKB, ifX' IS R-definable{or R-rough) in an NKB'{& EXTs{NKB)). 

(2) For subsets W'(c X) and R{c A), X' is surely R-definable{or R-rough) in 
NKB, if X' is R-definable{or R-rough) in every NKB\e EXTs{NKB)). 

Example 2. Let’s consider the following NKB 2 = {X, A, V, S, g). In this NKB 2 J 



A 


ai 


tt2 


T 




2 


T 


1 




T 


2 


2 



Table 2. Knowledge base N KB 2 



a set {l,3}(c X) is surely A-definable. Because, the following holds. 

(Casel) If we deal with g'{l,ai) = (l,d) then we seleet ^*(2,a2) = (l,d). In 
this ease, (ai = 1 A U 2 = 2) V (ai = 2 A U 2 = 2) expresses the set {1,3}. 
(Case2) If we deal with g'{l,ai) = (2,d) then we seleet ^*(2,a2) = (l,d). In 
this ease, ai = 2 A U 2 = 2 expresses the set {1,3}. 

As for the set {1}, it may be A-definable and also may be A-rough, because in 
(Casel) ai = 1 A U 2 = 2 expresses {!}, but in (Case2) we can not discriminate 
1 and 3. 

If A'(c X) is surely R-definable, then the X' is not influenced by the in- 
complete information. However if A' is not surely R-definable, then the X' is 
influenced by them. Suppose X' is R-rough in NKB'{^ EXTs{NKB)). In 
this case, the inclusion relation inf^xB*{X') C A' C sup^xB*{X') depends 
upon EXT{N K B, N K B')). Especially if the best inclusion relation 

uniquely exists in every EXT [NKB, NKB'), then we get the following propo- 
sition. 

r\EXTsiNKB)i'^''^fNKB*{X')) C X' C [JEXTsiNKB)i''^'''TNKB*{X')) . 

This inclusion relation is not influenced by the incomplete information. 

5 An Effective Procedure for Checking Rough Sets in N KBs 

In the previous section, we defined R-rough set concept in an NKB, whose 
definition depends upon the extensions in every NKB. Now in this section, we 
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show an effective procedure to check whether X' is R-definable or R-rough in 
N KB' EXTs[N K B)). We can easily revise this procedure to check whether 
X' may be R-definable (or R-rough) or is surely R-definable (or R-rough) in every 
NKB, 

Suppose R(c A)^ and we consider NKB' = (X, R, Vr, {d, s}, ^'). For every 
object Xi(G X), if g'[xi^a) = (_, d) for every a(G R) then we call Xi an object 
with fixed value. Otherwise, we call it an object with selection. We also define 
X fixed = {^i ^ X\xi is an object with fixed value }. We continue the definition. 
For every object with selection we can make an object with fixed value Xi^e 
by a selection 6^, which we call an object with selection 0 from X{. Finally we give 
the following definitions: 

(1) inf{xi^o) = {xi}[J{xj G X fixed] every attribute value in Xj and Xi^o is the 
same }. 

( 2 ) sup{xi^e) = {xj G X| there exists an object Xjpr with selection 0' whose 
attributes values are the same as Xi^e }- 

Then, we get the following proposition. Here [xi] implies an R-equivalence rela- 
tion including object Xi. 

Proposition 1. 

(1) The inf[xip) is the minimal R-equivalence relation including object Xip, 
which is not influenced by the selective information. 

( 2 ) The sup{xip) — inf{xi^o) is a set of objects, where every element Xj satisfies 
Xj Qt G [xi e] for some 0' and Xj ^ [xi e] for some 0" . 

( 3 ) X' which satisfies inf{xi^e) C X' C sup{xi^e) for some Xi and selection 0 is 
R-definable in NKB' . 

We have to remark that inf[xip) and sup{xi^e) are not independent in every 
Xi. The inf{xi^e) and sup{xi^e) are mutually related to other inf[xjp') and 
sup[xj^Qf). Now we show the overview of the procedure for checking R-rough set 
by the Example 3. 

Example 3. Let’s consider the following NKBs = (X, A, V, {d, s},g'). According 



X 


ai 


tt2 


T 




2 


T 


1 


2 


T 


1 


({l,2},s) 


4 


2 


2 



Table 3. Knowledge base N KB^ 



to the Table 3, we first prepare the following inf and sup relations for every ob- 
ject. 
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(a) {1,2} C [lai=l] C {1,2,3}, (b) {1,4} C [lai= 2 ] c {1,4}, 

(c) {2} C [2] C {1,2,3}, (d) {3} C [3„,=i] C {3}, 

(e) {2,3} c [3a,=2] C {1,2,3}, (f) {4} c [4] c {1,4}. 

We show how we check the R-definability by using a set {1,2,4}. We use (a) and 
conclude {1,2} is R-definable. In this case, we implicitly rejected (b) and selected 
3 ^ [1] and 3 ^ [2]. Namely we can not use (e), because (e) implies [3] = [2]. So 
we need to solve {4} is R-definable or not by (c), (d) and (f). If we use [4] = {4} 
in (f) then 1 ^ [4]. It contradicts (b). Therefore we try another search path. We 
use (b) or (f) and we conclude {1,4} is R-definable, here we rejected (a). So we 
need to solve {2} is R-definable or not by (c), (d) and (e). If we use [2] = {2} in 

(c) then 3 ^ [2]. Finally we check {3} by (d) and (e). We can select [3] = {3} by 

(d) , which makes no contradiction. Therefore {1,2,4} is R-definable in NKB^. 
This is the nondeterministic procedure with backtracking, which seems to be a 
kind of the resolution procedure. After checking the R-definability, we also get 
the selected values from selective information. The following is the overview of 
this procedure. 

[Overview for checking R-definability in N KB' ] 

Suppose we are given X'{c X), inf{xi^o) and sup{xi^o) for every Xi^o{e X). 
{l)SetX^=X'. 

(2) Pickup the first element Xj[^ X*) and find X" [c X*) sueh thatinf{xj^Qf) C 
X" C sup{xjpf) for some 0' . If we can not find such X" then X' is R-rough. 

(3) The usable inf{xip) and sup{xip) are restrieted by seleeting X" in (2). So, 
eheek the usable inf and sup, and go to (4). 

(4) If there is no eontradietion in (3), then set X"" = X"" — X" and go to (2). 
Espeeially if X"" — 0 then X' is R-definable. If there is contradiction in (3), 
then baektraek to (2) and try another X" . 

We are now formally refining the above procedure. As for the roughness of sub- 
set X' in an N KB is as follows. If A' is R-definable(R-rough) in \/NKB'{^ 
EXTs{NKB)), then X' is surely R-definable(R-rough) in NKB. If X' is R- 
definable(R-rough) in 3NKB'{^ EXTs{N K B)), then X' maybe R-definable(R- 
rough) in NKB. 

We can also apply the above procedure to find all R-equivalence relations in 
every N KB' EXTg^N KB)). The following is an overview. 

[Overview for finding all R-equivalence relations in NKB' ] 

Suppose we are given inf{xi^e) cind sup{xi^e) for every Xi^e{^ W). 

(1) Set A* = A. 

(2) Pickup the first element Xj{^ A*) and find X" {c A*) sueh that inf {xjpr) C 
X" d sup{xj or) for some 0' . 

( 3 ) The usable inf(xi^o) ond sup(xi^o) ore restrieted by seleeting X" in (2). So, 
eheek the usable inf and sup, and go to (4). 

(4) If there is no eontradietion in (3), then set [xj] = X" , A* = A* — A" and 
go to (2). Espeeially if X"" = 0 then we got an R-equivalenee relation. To 
find other relations, backtrack to (2). If there is contradiction in (3), then 
baektraek to (2) and try another X" . 

We can also revise this procedure for every NKB with incomplete information. 
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6 Concluding Remarks 



In this paper, we discussed nondeterministic knowledge bases NKBs with in- 
complete and selective information. The introduction of two information made 
knowledge bases more powerful and caused new several issues on knowledge 
bases. Our framework will be an advancement from rough set theory on deter- 
ministic knowledge bases. However, we have just prepared a tool, i.e., analytic 
procedure by inf[xi) and sup[xi) for every object X). /,From now on we 
will discuss several issues on NKBs^ for example, 

(1) Handling of multivalued attributes in NKBs, 

( 2 ) Significance of attributes in NKBs, 

( 3 ) Simplification of decision tables in NKBs, 

(4) Reduetion of deeision rules in NKBs, 

( 5 ) Real applieations in N KBs, 

( 6 ) Learning in NKBs. 
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The OI-Resolution of Operator Rough Logic 
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[ Abstract.] Based on rough set theory, this paper establishes opera- 
tor space [C=^,C^]. It is also a subset on truth value interval [0,1]. The 
operators is put in the front of the formulas to produce the many- 
valued logic called operator rough logic(ORL). It defines Ol-valid and 
Ol-inconsistent, Ol-resolution of the logic, where OI is an abbreviation 
of Operator Interval. And it also proves the soundness theorem of the 
logic resolution. 

Keywords: Rough Set Theory, Operator Rough Logic, OI-Resolution. 



1 The Notions in Operator Rough Logic 

Let U be a non-empty set, and R be an equivalence relation on U. If A C [/ is a 
subset on U, then the lower approximate set: = { xeU : R{x) C A }; the 

upper approximate set: R*{X) = { xeU : R[x) fl A 0 }, where 0 is an empty 
set, R[x) is an equivalent class to include x. The qualities of lower and upper 
approximation are denoted by 

^.=K(R^(X))/K(U) and r=K(R*(X))/K(U) respectively, where K(S) denotes 
the cardinal number of S which is limited as finite for short. Obviously, 0 < 
< e < 1- The qualities of lower and upper approximations, and are 
constructed as the operator interval [^=i=, Obviously, it is also a subset on [0,1]. 

Definition 1. Let be the operators, te[0,l] be the truth value of the 

formulas in ORL. Then 

(1) . Composite operation of operators )/2, where and 

are plus and division symbols in algorithm respectively; 

(2) . Taking negation operation of truth values ^t=l-t, where is a subtrac- 

tion operation symbol in algorithm. 

Definition 2. Let P be a n-place predicate symbol, ^c[^h= 7 ^*] be an operator, 
P(xi,. . ,yXn) be an atom of first order logic (FOL), then ^P is a rough atom of 
ORL. 

Definition 3. Let be a rough operator, then the formulas of ORL are 

defined recursively as follows: 

(1). ORL atom is a formula in ORL; 
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(2) . if W and Wi are the formulas in ORL, then ^W, WX/VKi, WAlTi, 

W W ^Wi are the formulas in ORL; 

(3) . If W is a formula of ORL, x is a free variable in W, then (Vx)W(x),(3x)W(x), 
(^Vx)W(x) and (^3x)W(x) are the formulas in ORL; 

(4) . The obtainable formulas what (l)-(3) are quoted finite times are the formu- 
las in ORL. 

Definition 4. Let A=(U,R) be an approximate space'^‘^\ 1 r and ur be rough 
interpretation of the formulas and rough assignment to individual variables in 
the formulas in ORL respectively. The term assignment Tj^ur corresponding to 
Ir and Ur is a map from term to objects, then 

(1) . If the term r is an object constant symbol to occur in W, then 
T/^^t^(r)=/i^(r)=e, where e is an entity on U; 

(2) . If the term r is a variable symbol to occur in W, then Tj^ur{^)=ur{t)=c^ 
where c is a constant on U; 

(3) . If the term r is a n-place function of the form 7r(ri,. . .,Tn), ii^(7r)=g and 
TjRUR{Ti)=Xi^ then 7/^^t^(r)=g(xi,. . .,x^), where g is a map from to U; 

(4) . If the term r is a n-place predicate of the form p{ri ^, . .,Tn), Ir{p)=P and 

then T/^^t^(r)=P(xi,. . .,x^), where P is a relation on U. 

The predicates to occur in the formulas of the logic are viewed as a relation. For 
any relation R, we might find out the lower approximation and the upper 

approximation R*{X.) with respect to R. Hence, the truth values of the formulas 
in ORL can be computed by mathematical formulas^^^^K 

The truth value of a formula, Tj^urO^) is uniquely determined through the 
following definition. 

Definition 5. Let W and IFi be formulas in ORL, Ir and ur are interpretation 
and assignment of the formulas respectively, then 
(!)• 

(3) ( WV W-i )= ( W) ,Tinun{Wi ) } ; 

(4C-^/r«r(Wa1^i)= minp>„„„(W);i/„„„(lCi)}; 

(5P-//rkr(('^x)W(x))= minp/„„„(W(xi)),. . 
where xeU and U are limited as finite. 



2 The OI-Resolution of ORL 

The truth values of the formulas are taken on [0,1], and the resolutions of ORL 
with respect to operator set [^=i=,^*] are called Ol-resolution of ORL. 

Definition 6. Let [^h=,^*] be an operator set, [0,1] be the truth value set of 
formulas in ORL, if + ^*)/2 > 0.5, and then W is called 

K=f= 7 ^*]-valid, in brief written by Ol-valid; (W)<^=j=, then 

is called [^=i=,^*] -inconsistent, in brief written by Ol-inconsistent. 

For any formula in FOL, there is a set of clauses that is equivalent to the original 
formula in that the formula is satisfiable iff the corresponding set of clauses is 
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satisfiable. By the properties^^\ we have following proposition in ORL. 
Proposition. Formula in ORL, W can be transformed into a Conjunctive Nor- 
mal Form (CNF): 



W' = C\A...ACm 

where m>l and each Ci, i=l ,, . ,,m, is a disjunction)^^ . 

Definition 7. Let be an operator, then ^L and ^^L are literals in ORL. 

The former is a position literal; The latter is a negation litaral. 

Definition 8. Let and (^=i=+^*)/2>0.5, then ^L and ^^L is a com- 
plementary pair of literals with respect to in ORL, in brief written by 

Ol-complementary literal; And if where =r is rough equal sign with re- 

spect to the error 9 then ^L and ^iL are called similar literals, for short denoted 
by Ol-similar literal. 

The oprator ^ can only be limited to move into before predicates and to have 
no concern with individual variables and items within the predicates, hence the 
unification algorithm in ORL is similar as in FOL)^^ . 

Definition 9. Let C\ and C2 be two clauses with no individual variable in com- 
mon in ORL, and ^2^2 be two literals in C\ and C2 respectively. If Ci and 
C2 have a most general unifier(mgu) a in the algorithm of FOL, and o a 
and ^2-^2 o a is a complementary pair of literals, where o is an operation sign 
of substituting composite, then the clause Ci-^fLJ®U(C2-^2^2^) is called binary 
resolvent of C\ and C2 in ORL, denoted by OI(Ci,C2), where is a set of 
Ol-similar literals in the Q. 

The resolution is sound in that any clause that can be derived from a set of 
clauses using resolution is logically implied by that set of clauses in FOL)^'^\ 
Thus we have following Ol-resolution sound theorem in ORL. 

Theorem.Let S be a set of clauses in ORL, if there is a resolution reasoning of 
the clause C from the set S of clauses, then S logically implies C. 

Proofdt is finished by simple induction on the length of resolution reasoning. 
For the induction, we need to show only that any given resolution step is sound. 
Suppose, then, that C\ and C2 are arbitrary clauses in ORL, that resolve to 
produce a new clause: 



■ .,^n^n})oaU {C2-{UW. . (1) 

where a is mgu of Li,. . ,^Ln and Li’,. . are the Ol-simiral 

literal; . .,m) are the simiral literal. We prove the ( 1 ) to be sound, 

only need to show that 

is Ol-valid for any Ir and ur. 

Because a is a mgu of Li,. . ,^Ln and L\\, . ,^Lm\ so it can set L=L^=L^-^, and 
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• •TJ= 17 * • thus (2) can be written by 



(Cf-gC})u (3) 

By inductive assumption, we have C\ and C 2 which are 01- valid, therefore, if 
is Ol-valid then ^ is Ol-inconsistent, thus ^L} is 01- valid; Again, if 

is Ol-inconsistent then ^ is 01- valid, hence is 01- valid. There- 

fore, the truth value of (3), namely (Cf-{^L})u(C 2 
is 01- valid. Hence, the original resolvent of Ci and C 2 

{C\-{^iLu- . .,LR})oaU 
is Ol-valid. The proof is finished. 



3 Conclusion 

It proposes the system of ORL in the paper. The operator interval [^=i=,C] 
constructed by the quality of lower and upper approximations based on rough 
set theory. It can be computed by mathematical formulas ^^^^^ . But the operator 
A in [4] is an approximate degree of artificial forgery, its operator interval i7 is 
also imginary. These show that ORL is different from other many-valued logic. 
On the other hand, the relations are used as propositions and predicates in ORL, 
while the relations are not easily defined, maybe, this is the complexity of ORL. 
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Abstract. This paper addresses the extraction of symbolic knowledge 
from trained artificial neural networks. Specifically, for that purpose the 
so-called pedagogical approach is incorporated, where the trained net- 
work is used as an oracle when inducing the symbolic description. We 
present an essential extension of the Trepan algorithm proposed orig- 
inally by Craven and Shavlik [4] [5]. The crucial modification concerns 
the way of generating artificial training instances. The paper ends with 
an empirical verification of the proposed method on popular machine 
learning benchmarks and comparison with the original Trepan. 



1 Introduction 

It is commonly recognized, that artificial neural networks (ANNs) became in the 
last decade one of the most promising computational paradigms in Computer 
Science, especially in Artificial Intelligence (AI) and related disciplines. Besides 
being a subject of extensive theoretical studies, they often outperform other 
approaches in various practical applications, related to machine learning, pattern 
recognition, signal processing, to list only a few. 

Many controversies concerning ANNs were risen and discussed in the liter- 
ature already in late 80’s [2] [7]. One of the most important and still unsolved 
problems related to ANNs is their opacity^ also referred to as hlack-hox problem. 
These qualifications refer to the inherent inability of ANNs to explain, in terms 
of languages legible to human beings, decisions they make and knowledge they 
posses. This is due to the distributed model of processing implemented by net- 
works, where the final decision is a resultant of many processing elements (PEs) 
working simultaneously. The explanation is hidden in many synaptic connections 
between particular PEs. That shortcoming prevents ANNs from being used in 
many real-world domains, where explanation is an essential point, due to, for 
instance, importance of the decisions being made. 

Thus, from the beginning of 90’s, there has been an increasing interest in the 
literature in the topics related to the above-mentioned problem. The early pro- 
posals [9] [6] were mostly focused on the explanation ability of an ANN concerning 
justification of the decision taken by the network for a particular example. There 
have been also some research concerning hybrid systems, like [19], where an ANN 
is used to refine the pre-defined symbolic knowledge. However, most of the work 
done in the area addresses the more general problem of extracting the knowledge 
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in symbolic form from an ANN, shortly referred to as knowledge extraction. As a 
language for the symbolic representation the decision rules [18] [8] [3] or decision 
trees [5] are usually chosen. 

This paper addresses the extraction of decision trees from ANNs by means of 
the so-called pedagogical approach. Specifically, we present an extension of the 
Trepan algorithm proposed originally by Craven and Shavlik [4]. The crucial 
modification concerns the way of generating artificial training instances. 



2 Decompositional and Pedagogical Methods 

In the literature, many different formulations of the problem have been proposed, 
however, the most suitable for the purpose of the considered research is the one 
given by Craven and Shavlik [3]: 

"Given a trained neural network and the examples used to train it, 

produce a concise and accurate symbolic description of the network" 

It is assumed in the above definition, that the set of training examples is 
accessible beside the trained network. Following that fact, we assume (which 
is not explicitly stated in the definition) that the symbolic description will be 
extracted from the network in context of the data set. Thus, we should not be 
interested in the knowledge which is irrelevant from the viewpoint of the train- 
ing examples (and, thus, for the considered real-world classification problem). 
Secondly, two criteria are taken into account when evaluating the resulting de- 
scription: conciseness (length), which should be minimized to ensure legibility, 
and fidelity with respect to the network, obviously maximized (expressed, for in- 
stance, as a percentage of decisions consistent with the network). Unfortunately, 
these criteria are usually conflicting - high fidelity often requires extensive de- 
scription. Their importances depend on the final use of the extracted knowledge; 
for instance, for explanation purposes, mostly the conciseness matters. 

In late 80’s and early 90’s, most of the proposals concerning the knowledge 
extraction from ANNs were based on the so-called decompositional paradigm^ 
where the description is built by decomposing the network into particular PEs 
and analyzing their weights (e.g. [10] [8]). The exploration of the network is here 
formulated in terms of a search problem. Their main deficiency is high computa- 
tional complexity, resulting from taking into account all (or many) combinations 
of signals coming into a particular PE. Moreover, to limit the search space, they 
usually discretize weights or signals of the network, which results in questionable 
fidelity of the obtained description. 

To cope with the above-mentioned problems. Craven and Shavlik introduced 
in 1994 [3] a completely different approach to the knowledge extraction from 
ANN, which incorporates the so-called pedagogical principle (known also as an 
oracle-based approach). The main idea relies on employing some symbolic ma- 
chine learning algorithm (inducer), producing, for instance, decision rules or 
decision trees. In general, it is possible to incorporate some existing system, like 
C4.5 [14]. Then, knowledge extraction consists in the induction of a classifier, 
however, using data set which is fundamentally modified in comparison to the 
original set of examples used for ANN training. 
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The above modifications are twofold and rely on using the trained network 
as an oracle^ i.e. on testing its answers on examples prepared in a way described 
below. Firstly, for each example, its (original) class label is replaced by classi- 
fication suggested for it by the network. This ensures the fidelity of the result- 
ing description with respect to the considered network. Secondly, the original 
training set is being enriched by additional, ‘artificiak, examples. The values of 
condition attributes for an artificial example are generated at random according 
to the probability distributions of particular attributes in the original data set. 
The class label is induced using the network as an oracle, as it is the case for the 

original examples. The introduction of artificial examples allows for more 
precise examination of the behavior of the network. Specifically, this technique 
is extremely useful for induction of decision trees, being a remedy for the well- 
known problem of rapidly decreasing number of cases supporting the splitting 
of tree nodes on its lower levels. With that extension, the inducer is able to 
supplement the original examples supporting the node with the artificial ones. 

Thus, the main difference between decomp os it ional and oracle-based methods 
is that the former consider the network as a whole. As a consequence, one avoids 
to some degree an oversimplification (e.g. due to discretization) of the network, 
which should lead to better fidelity of the induced description. The distributed 
nature of the network processing is also preserved by the oracle-based method. 
Moreover, the computational complexity of the approach is basically determined 
only by the complexity of the inducer, which is usually polynomial. However, it 
should be stressed, that the pedagogical approach extracts only the processing 
{behavior) of the network and omits its structure^ while the decompositional 
methods consider both those aspects. 

In this paper we present a significant extension of the Trepan algorithm 
[4] [5], which is, in our opinion, the most interesting representative of the ped- 
agogical paradigm. The next sections outline the algorithm and the proposed 
extensions. 



3 The Trepan Algorithm 

The Trepan algorithm (TREes PArroting Networks), proposed originally in 
[3], is probably the most comprehensive representant of the oracle-based ap- 
proaches. It chooses decision trees as a language for a symbolic representation 
of the extracted knowledge. A thorough description of the method can be found 
in [5]. The aim of this section is merely to outline it as an introduction to our 
work. 

As mentioned before, the main idea of Trepan is to view the knowledge 
extraction from a trained network as an inductive learning task. Since the chosen 
knowledge representation is a decision tree, the authors refer to popular tree 
induction algorithms, such as CART [1], ID3 [13], and C4.5 [14]. The remainder 
of this section presents the similarities and differences between conventional 
decision-tree inducers and Trepan. 

The decision-tree inducing methods proceed in general as follows. Basing 
on a set of training examples, the input space, spanned over the attributes, is 
recursively partitioned in order to separate examples with different class labels. 
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Each inner node of the induced tree represents some partition of the input space 
and each leaf is labelled by the class predicted for those examples which reach 
this leaf. Every partition of the input space involves a test on the attribute 
values. Such a test may be a condition placed on a single attribute (as in [14]). 
In that case, a test on a symbolic (nominal) attribute with n possible values gives 
n outcomes (child nodes), one for each value. A test on a continuous attribute 
is specified by a threshold v on its values and results in 2 outcomes (one for 
examples with the attribute’s value < t;, and the other for the remaining ones). 
The tests originally used in Trepan were more complex M-of-N conditions, 
introduced to decision trees in [12]. 

Since the objective is to build compact yet accurate decision trees, choosing 
the best of all possible tests at each tree node is a nontrivial problem. Trepan, 
as many other tree inducers, uses information gain [13] as the criterion for evalu- 
ating candidate tests. According to this entropy-based criterion a test is chosen, 
which maximizes the information gained about the class labels of the examples. 

Considering that, one major inconvenience of conventional decision-tree build- 
ing algorithms becomes apparent: the number of training examples available at 
a tree-node decreases with the depth of the tree. Thus, tests near the bottom 
of the tree may often be poorly chosen due to insufficient number of examples. 
Trepan overcomes this by randomly generating additional, artificial training 
examples when needed. In fact, additional training examples are generated for 
every tree node in which the number of original examples is less then Smin^ 
where Smin is a parameter of the algorithm. It is worth emphasizing, that such 
a solution would not be possible when learning a concept underlying some given 
set of examples, since the class labels for artificial examples would be unknown, 
in Trepan however, the target function is the concept given by an ANN - the 
oracle which is used to assign class labels to artificial examples as well as to the 
original ones. 

In generating additional training examples the authors of Trepan use a 
fairly simple approach based on modeling the marginal distributions of individual 
attribute values. Such an approach suffers from the fact, that it does not take into 
account possible dependencies among the attributes. The authors of Trepan 
are aware of that problem and try to overcome it by estimating the marginal 
distributions locally for the particular nodes as the decision tree is being built. 
However, this solution is not fully satisfactory. In our opinion the method used 
for generation of training examples has a major impact on the quality of the 
extracted trees, so it is worth further investigation. Thus, in the next section we 
propose an entirely new method, which should result in generation of artificial 
examples preserving the interdependencies among attributes. 



4 Attribute-Interdependency Preserving Method 

As already mentioned in Section 3, we are strongly convinced that the method 
of generating artificial training data is crucial. Ignoring the existence of depen- 
dencies among features often results in obtaining ‘nonsense’ examples, which do 
not have a sensible interpretation in the problem domain. This is particularly 
true for continuous features, since the examples are generated randomly from 




440 



K. Krawiec, R. Slowinski, I. Szczesniak 



within a hypercube extending from the minimal to the maximal values of every 
attribute. The problem aggravates with the increase in space dimensionality, i.e. 
the number of the attributes. 

As a further consequence, the induced decision tree approximates the be- 
havior of the considered ANN also in those parts of the problem space where 
real-world examples are unlikely to be found. Taking into account that our aim 
is to built a concise, and thus legible description of an ANN, it becomes obvious, 
that considering nonsense examples may lead to an unsatisfactory exploration of 
relevant parts of the problem space. Taking the above into account, we propose 
herein a simple heuristic, which regards the actual distribution of the instances 
in the problem space to produce artificial examples. 

In order to generate a single artificial training example a' at a specified tree- 
node N our algorithm, called hereafter Trepan+, employs the following steps: 

1 . Randomly choose an example a = ? • • • ? those from the original 

training set, which have reached N , 

2. For every attribute i = 1,2, ...,m generate a random value according 
to a Gaussian distribution around the value (the variances cjN of the 
distributions are estimated individually for each attribute i and examples in 
node A^). 

3. Use the oracle (i.e. query the ANN) to determine the class label for the 
newly generated example a' = (vi ,^2 ? 

Thus, an artificial example a' is being created by introducing perturbations 
into values of attributes of a randomly chosen original example a. We claim that 
such a technique preserves the dependencies among the attributes better than 
the technique using marginal distributions. As an illustrative example, let us 
consider a data set described by a pair of attributes (x,?/), where the exam- 
ples constitute 3 groups of equal quantity. The attribute values are (0.25,0.25), 
(0.25,0.75), and (0.75,0.25) for the first, second and third group of examples, 
respectively. As it can be seen in Fig. 1, the original Trepan generates artificial 
examples based on marginal distributions, which results in an extra group around 
the values (0.75,0.75), which does not exist in the original data set. Moreover, 
the probability around (0.25,0.25) is overestimated at the expense of the two 
remaining groups. On the contrary. Trepan + is free of these deficiencies. 

Our method uses the well-known Box-Miiller algorithm for generating ran- 
dom numbers according to a Gaussian distribution, and thus has another major 
advantage of being efficient in comparison to, e.g., Monte Garlo technique used 
in [17]. This is particularly important considering the usually large number of 
examples generated at every tree node. 

5 Empirical Evaluation 

The main aim of the experiment was to compare the fidelity and size of trees gen- 
erated using the original Trepan and Trepan+. To obtain comparable results, 
the computational experiments have been carried out on well-known reference 
data sets. As the ANNs prove their usefulness especially on continuous (quanti- 
tative, real- valued) attributes, and extraction of knowledge for such attributes is 
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TREPAN 





Fig. 1. Probability distributions of artificial examples for both compared methods. 



in general harder than for nominal ones, we focused our attention on the former 
type of attributes. Thus, from the Repository of Machine Learning Databases [11] 
a few domains composed solely of quantitative attributes have been chosen: Iris 
(Fisher’s iris plant database). Glass (glass identification database), Pima (Pima 
Indians diabetes database), and Bupa (liver disorders). The computations have 
been performed also for the Busses data set, collected in our environment [16]. 
Table 1 outlines the profiles of the data sets. 

Standard feed- forwarded multi-layer perceptrons with sigmoidal transfer func- 
tion were subject of the knowledge extraction. However, to speed up the training, 
instead of backpropagation, we used much more efficient RPROP algorithm [15] 
with a threshold on MSE error as the stopping condition. The number of hid- 
den layers and their sizes have been estimated experimentally to maximize the 
accuracy of classification. The topologies of the networks are shown in Table 1. 

The experiments have been carried out in the cross-validation framework, 
to ensure the robustness of results. Each of the 10 cross-validation iterations is 
composed of the following steps: (i) network training on the training set, (ii) 
network verification on the testing set, and (iii) extraction of symbolic represen- 
tation in context of the training set, using original Trepan and our modified 
version Trepan + with improved generation of artificial examples. Each result- 
ing description (decision tree) is then verified on the testing set with respect 
to the fidelity to the ANN it has been extracted from, and with respect to the 
classification accuracy. It should be stressed, that both Trepan and Trepan + 
work on precisely the same networks. 

The parameters of both the algorithms have been adjusted to obtain optimal 
and similar (as far as it was possible) fidelity with respect to the network. The 
generation of M-of-A conditions [12] has been disabled in the original Trepan, 
as it did not improve the results and would have made difficult the comparison 
between both algorithms, because Trepan+ does not implement that feature. 

Table 1 presents the above-mentioned results of the experiments, with fidelity, 
tree size, and test set accuracy averaging over 10 iterations of cross validation. 
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Table 1. Data sets, ANN architectures, and properties of extracted decision trees. 



Data set 


Iris 


Glass 


Pima 


Bupa 


Busses 


Examples/ Attributes/Classes 


150/4/3 214/9/6 768/8/2 345/6/2 76/8/2 


Network architecture 


4-5-3 


9-10-6 


8-10-2 


6-5-2 


8-5-2 


Test set fidelity with Trepan 


93.3 


56.3 


81.0 


84.9 


96.1 


respect to ANN [%] Trepan-K 


93.3 


71.5 


81.2 


84.1 


96.1 


Tree size (no. Trepan 


3.8 


7.2 


13.1 


5.1 


1.8 


of internal nodes) TREPAN-h 


2.5 


6.1 


4.4 


6.5 


1.0 


Test set accuracy [%] ANN 


95.3 


65.0 


73.6 


67.5 


98.7 


Trepan 


92.7 


49.4 


71.7 


64.1 


94.7 


TREPAN-h 


96.0 


64.5 


73.8 


64.1 


94.7 



6 Conclusions 

The results of computational evaluation confirm basically the main thesis of the 
study. As it is shown in Table 1, both the algorithms yield decision trees of com- 
parable fidelity with respect to the ANN, however, the knowledge extracted by 
Trepan + is in most cases much more compact, and thus more comprehensible. 
That advantage is especially impressive in case of the Fima set, where the trees 
built by Trepan + are on average about three times smaller than those built by 
Trepan (4.4 vs. 13.1 internal nodes), preserving at the same time the fidelity of 
description. The trees built by both algorithms for the Glass data set are similar 
in size, however, we were unable to force the original Trepan to achieve fidelity 
comparable to Trepan+. The only exception from that tendency is the Bupa 
data set, where Trepan yields smaller trees with slightly better fidelity. This 
may be due to more sophisticated tree induction strategy used by Trepan. It 
should be also mentioned that the trees built by Trepan+, beside being smaller, 
obtain better accuracy of classification. 

Thus, the above experiment shows, that taking into account the interde- 
pendencies between attributes when generating the artificial examples improves 
significantly the quality (especially size) of the resulting description. Ignoring 
those dependencies may lead to decision trees which are in some part redun- 
dant, as they are partially founded on examples which have no interpretation in 
terms of the domain knowledge. 
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[ Abstract.] We present a method for processing of EEG signals 
by means of wavelets, rough set based algorithms and neural 
networks. The hybrid approach makes problem of discerning 
between posttraumatic epilepsy and other causes of epilepsy 
solvable. Experimental results are showing that proposed ap- 
proach is promising. 



1 Introduction 

The problem of epilepsy diagnosis is one of important problems in EEG analysis. 
The research is carried out in two directions. The first one is related to a detection 
of characteristic patterns in epileptic EEG. This automated analysis of signals 
helps physicians to look quickly throughout great amount of data. Some results of 
research in this direction are reported in [8]. They are devoted to differentiating 
singlefocal and multifocal epilepsy. Another approach is based on search for 
relevant features which could support diagnosis. Our paper is relevant to the 
second approach. In this paper we propose to use the wavelet analysis and to 
apply to the data that is received from this analysis some rough set methods 
or neural networks’ based ones. Wavelets have already proved their efficiency in 
the field (see [9]). 

The objective of this paper is to solve the following problem for EEG signals. 
There are two groups of children suffering from epilepsy. The first, as clinical 
treatment says, suffers posttraumatic epilepsy (denoted B) and numbers 11. 
Epilepsy of children in the second group (denoted A) is due to other facts and 
there are 25 children in this group. There are also two additional groups probably 
belonging to the basic ones. Eor the purpose of this research, they were added 
to appropriate basic groups. Eor every patient we have at hand an EEG score 
of 21 electrodes of 2,5 second sampled at frequency 102,4 Hz. The problem is: 
Gan one, using only these EEG scores and no other clinical information, distinct 
between posttraumatic epilepsy and the other group? 

The experiments are showing that hybrid methods using wavelets in combi- 
nation with rough set methods and neural networks can give satisfactory results. 
The initial data has 5736 real-valued attributes for 44 objects. This large num- 
ber of attributes with many values causes that direct application of methods 
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for decision rule generation can not give a satisfactory solution. Hence, in pre- 
processing compression of data by wavelet methods is used. Data received from 
wavelet analysis are used as input for rough set or neural network methods. In 
many cases hybrid methods can give better results (see for example [10]). 

The paper is organized as follows: In Section 2 the basis of applied wavelet 
analysis is presented. 

In Section 3 we present methods used for further processing of the data. After 
their application one treats data with two classification methods. These are : 

(i) The rough set methods implemented in RSES library (see [7],[1] ) by the 
team supervised by A. Skowron in Institute of Mathematics in Warsaw Univer- 
sity and 

(ii) methods based on artificial neural networks by means of system Neural 
Works for Professional 2 Plus. 

We have used Cross-Validation scheme. For rough set methods the classifica- 
tion was done in 5-Cross-Validation scheme, while for ANN - 2-Cross-Validation 
scheme was applied. Dividing data into two subgroups for ANN is better setup 
than 5- Cross-Validation what was confirmed experimentally. Both methods an- 
alyze one subgroup and then classify the other. 

Section 4 summarizes our results. 

Let us note that a small cardinality of group B makes mean efficiency useless, 
so we give classification efficiency for both groups. As group B is more difficult 
to classify, we interpret the result as best, if it is best for group B. 

2 Wavelet analysis of the signal 

We construct a wavelet in a way presented in [5]. For a certain sequence of 
coefficients (c^) we build a function satisfying the, so called, scaling equation 

<f{x)= Cn<f (2x - n) . 
n^Z 

Then we define a function 

^ (^) = V *^“5” C-n+iV^ (2x - n) . 

nGZ 

For an appropriate choice of (c^) we get ^ [x) being a wavelet i.e. functions 
ipjk (x) = (2^ X — k) with j^k ^ Z form an orthonormal basis for [R), 

it means that the expression i^jk gives a perfect reconstruction of /. 

Moreover, rejecting coefficients of expansion satisfying \{f^i^jk)\ < we obtain 
a function different from / by Cy in L^-norm. Therefore, to compress the signal 
/ we replace it with 

/= V ifXjk)'ipjk 

{j,k)eA^ 

where = {{j^k) G Z^ : \{f:i^jk)\ > • These facts are connected with the 

fact that wavelets form unconditional bases for spaces [6]. These bases are 
optimal for compression as is proved in [3] . 
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For the purpose of this paper we choose Daubechies wavelet of the 5th order 
[2]. It is a function of class with a compact support contained in the interval 
[0,9]. Fig. 1 and 2 present respectively a scaling function and a wavelet. 



Fig. 3 presents the dependence between compression ratio and 



/-/ 



II \\L<^ 

We have chosen for our purposes Pq = 1 ? 0 coefficient ensuring enough com- 
pression ratio and signal quality. Fig. 4 shows the accuracy of wavelet compres- 
sion. 



3 Data processing 

3.1 Prequential analysis 

Wavelet analysis of the signal resulted in a quite small number of coefficients 
(/, 'ipjk) having absolute value bigger than Pq = 1,0, namely about 15. However, 
for application of discretization procedures it is important to get the same co- 
efficient for every patient and electrode that could serve as an attribute. The 
straightforward procedure of taking all coefficients with absolute value big- 
ger than Po = 1,0 at least for one patient and one electrode would result in 
86 X 21 = 1806 attributes for every patient which is too big number. Therefore, 
we have decided to apply frequential analysis. The Sequential analysis consists 
in choosing wavelet coefficients (/, with numbers {j^k) such that the num- 
ber of their representatives greater than '^?o in all electrodes and for all patients 
is greater than a certain threshold M. 

For threshold M=200 we have obtained 29 coefficients creating our basic data 
D. For any signal these 29 coefficients are considered and these with absolute 
value smaller than '^?o are replaced by 0. This data gives a result of big dispersion, 
which can be seen from the efficiency of classification (see Table 1). ANN even 
failed to solve this problem because of its time complexity. 

In next experiments no. 3-5 we took the 4 most frequent coefficients with 
M=660. 

The last case (experiments no. 6-7) belongs to the most complicated ones. We 
performed it to find features of the signal translated in time. We did it considering 
13 coefficients from 5 levels (5 different values of j) and consecutively located in 
time (subsequent values of k). These coefficients have M?=i:i600. As an input data 
we took averages of these 13 coefficients in levels. 

Results of these experiments are presented in Table 1. The best efficiency 
was obtained in 4th experiment with M=660 analyzed by RSES with dynamic 
scaling: 75% (A) and 56% (B). 

3.2 Global cuts 

The basic data D was analyzed by means of RSES procedures for discretization 
[4]. The interesting fact is that only 4 cuts is enough to split all data D into 
subsets of decision classes. Therefore, efficiency of this classification is 100% 
(A) and 100% (B). We claim that they may be relevant features of EEC. The 
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No. 


Experiment 


Efficiency (A) 


Efficiency (B) 


1 


M=200 (RSES) 


60-71 % 


27-58 % 


2 


M=200 (ANN) 


* 


>1= 


3 


M=660 (RSES) 


80-82 % 


33-38 % 


4 


M=660 

(RSES dynamic scaling) 


75 % 


56 % 


5 


M=660 (ANN) 


66 % 


54 % 


6 


means 600 (RSES) 


68 % 


40 % 


7 


means 600 (ANN) 


63 % 


54 % 



Table 1. Application of frequential analysis - the efficiency classification 



following experiment confirms it. We translated the cuts from their attributes to 
the appropriate attributes from the next electrode. Explicitly, we changed the 
cuts (no. of coefficient, no. of electrode, threshold) to (no. of coefficient, no. of 
electrode+1, threshold). It resulted in good efficiency of classification. Namely, 
93% (A) and 73% (B). 

3.3 Best cuts 

Our next approach was to scale data before analysis to ensure its greater stability. 
Using discretization procedures from RSES library 20, 50 and 100 cuts discerning 
the largest possible number of pairs for population were generated. Data has been 
scaled using these cuts and then analyzed by RSES system and ANN. Results 
are presented in Table 2. 



No. 


Experiment 


Efficiency (A) 


Efficiency (B) 


1 


20 best cuts (RSES) 


75 % 


50 % 


2 


20 best cuts (ANN) 


62 % 


52,5 % 


3 


50 best cuts (RSES) 


84 % 


64 % 


4 


50 best cuts (ANN) 


73 % 


46,5 % 


5 


100 best cuts (RSES) 


100 % 


3 % 


6 


100 best cuts 

(RSES with additional scaling) 


85 % 


69 % 


7 


100 best cuts (ANN) 


89,5 % 


47 % 



Table 2. Application of best cuts - the efficiency of classification 



It can be seen that they are good for smaller numbers of cuts. The optimal 
result was obtained for 50 best cuts with RSES (3rd experiment ) : 84% (A) 
and 64% (B). The result of 6th experiment is robust - 85% (A) and 69% (B). 
However, one should remember that in this experiment data was scaled twice. 
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3.4 Pattern identification 

This experiment comes from medical intuition about the problem. It consists 
in seeking for a pattern in EEC score or for its elements. We have chosen four 
triples (T /c, i}jk) and data was scaled putting 1 if (/, i^jk) > '^jk £^nd 0 otherwise. 
Eor each electrode we count ones and their number is one of 21 attributes. Thus, 
we compare wavelet components of the signal with the pattern and the number 
of ones is the quality of their matching. These four triples correspond to the 
pattern j.'^jki^jk given in Eig.S.The classification quality in this case was 
68,5% (A) and 67% (B) by means of ANN. 

4 Conclusions 

1. Best results are: 75% (A) and 56% (B) for frequent ial analysis, 84% (A) and 
64% (B) for best cuts’ approach, 85% (A) and 69% (B) for best cuts’ approach 
with additional scaling and 68,5% (A) and 67% (B) for pattern identification. 
The obtained results allow us to suggest that described hybrid method can be 
treated as a promising tool for classification of objects. However, further work 
in this direction is needed. 

2. The comparison of results shows that improvement of classification in 
group B yields worse results on group A. One of possible causes may be due to 
the fact that part of data from group B does not include information necessary 
for efficient classification. 

3. The whole population can be divided by using only 4 cuts to subsets of 
decision classes. Hence the decision classes can be defined by these 4 cuts. Our 
experiments with translated cuts are showing that they carry a real piece of 
information from EEC. 

4. Eurther works on rough set methods should allow to discover features of 
signals discriminating groups. Then wavelet analysis may be concentrated on 
preprocessing efficient in denoising to make these features detectable by rough 
set methods. 

5. The analysis of attribute dissimilarities for patients from different decision 
classes is a promising field of investigation. 

6. Next direction of further research is optimization of wavelets with respect 
to essential signal features finding, compression ratio or reconstruction accuracy. 
It requires interpolating wavelets. This enables us to use genetic algorithms for 
wavelet optimization. 
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Abstract. The paper addresses the problem of redact generation, one 
of the key issues in the rough set theory. A considerable speed up of 
computations may be achieved by decomposing the original task into 
subtasks and executing these as parallel processes. This paper presents 
an effective method of such a decomposition. The presented algorithm is 
an adaptation of the redact generation algorithm based on the notion of 
discernibility matrix. The practical behaviour of the parallel algorithm 
is illustrated with a computational experiment conducted for a real-life 
data set. 



1 Introduction 

This paper addresses the problem of reduct computation in information systems. 
The reduct is a notion that has been given much attention in numerous papers 
within the Rough Set community [4, 5, 6, 7, 9]. The idea of reduct and attribute 
reduction in information tables is, in general, related to a broader problem of 
feature selection, which has been the focus of many papers in the area of Machine 
Learning (see [1] for a comprehensive review of feature selection methods). A 
successful application of rough set reducts has been reported e.g. in [3]. 

One of the most challenging issues related to reducts is the problem of reduct 
generation. This is because generating reducts is a computationally complex task. 
The problem of generating a minimal reduct has been proved to be NP-hard in 
[7]. This result directs the line of research towards different methods of handling 
the problem of computational complexity. As a result, the reduct generating 
procedures employ both exact and heuristic methods. 

This paper deals with exact methods. The forthcoming sections provide moti- 
vation for parallelization of reduct generation and introduce a parallel algorithm 
for the computation of all exact reducts. The algorithm is based on the decom- 
position of an exact algorithm from [7]. Because introducing the framework for 
formal definition of reducts is impossible due to the paper size restrictions, the 
reader is referred to [4, 5, 7, 9]. 

The presented methodology is certainly not the only existing approach to 
the problem of decomposition in reduct generation. It focuses on a single data 
set and the parallelization is introduced mainly to speed up the computations 
for this data set. An earlier approach from [4] considers information systems in 
which the data are distributed among several locations. The process of reduct 
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computation may then proceed independently in all of those locations and the 
results may be combined together to produce the final set of reducts. 

The rest of the paper is organized as follows. Section 2 presents a sequen- 
tial reduct generation algorithm. Section 3 introduces the parallelization of this 
algorithm. In Section 5 results of the experiment aimed at providing practical 
evaluation of the parallel algorithm are presented and discussed. The last section 
draws attention to some unsolved problems and outlines directions of the future 
research on the subject. 

2 The Sequential Implementation of the Algorithm 

The original algorithm from [7] is based on the notion of the discernibility matrix. 
The result of the algorithm is the set of all exact reducts. The main notion of the 
algorithm, the discernibility matrix, is defined as follows. Given the information 
table IT =< U^Q^V^d >, \U\ = TV, the discernibility matrix is defined as the 
matrix [Cij], i = 1..N j j = l..yV, where: 

Cij := {q G Q : d{ui^q) ^ d{uj^q)^ for each pair i,j}. 



Input: A set of objects U (|LyA0 described by values of attributes from tlie set Q. 

Output: The set K. of all (absolute) reducts for tlie set U, 

PHASE! 

St ep / 

Create the absorbed discernibility list ADL for elimiuatiug empty and non-minimal elements; 

ADL:^{C,y and for no C],^-,eADL: C^czCjj}, wliere; 

C,y^-{qeQ: 8(it^q)^S(itpq), for each pair/,/}. 

The resulting absorbed discernibility list contains elements (Ci, (S, fti), where de [I, N(N- !)/2]. 

Step 2 

Sort the ADZ, in the ascending order of the cardina lity of its elements. 

PHASE II 
St ep / 

Step 2 

For/^./..r/ compute; 

y;-{RGR^i ; RnC,^0}. 

U U {Ru(q}}. 

ijeCi RePd-v.RnCtit0 

A/ZV;-{Rg r,; MnfRJDLp^true}. 

The final result is K:^Rd, 



Fig. 1. The Modified Reduct Generation Algorithm (MRGA) 



The main idea of the method is to find all minimal attribute subsets that 
have non-empty intersections with all (non-empty) elements of the discernibility 
matrix. The algorithm consists of two phases: the first phase creates the elements 
of the discernibility matrix, and the second one generates reducts using these 
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elements. To improve efficiency, the elements of the discernibility matrix are 
stored in form of an absorbed lists. In the last step of the first phase, what is a 
modification of the original algorithm, the absorbed list is sorted in the ascending 
order of the cardinality of its elements. The modified reduct generating algorithm 
(MRGA) is presented in Figure 1. 

In the algorithm, Min performs an operation that corresponds to checking 
for prime implicants of a Boolean function. The returned value is true if the 
argument R does not contain redundant attributes. ADLi i is a list of i initial 
elements of ADL: ADLi^^i = (C1,C2, ...,Ci). Thanks to the links established in 
[7], the algorithm may be also directly applied to the problem of searching for 
all prime implicants of Boolean functions. 

In the form given above the algorithm searches for so called absolute reducts. 
This is determined by the definition of the discernibility matrix, which is com- 
puted with a uniform treatment of all attributes from Q. If the split of Q into 
condition (C) and decision [D) attributes is to be taken into account and so 
called relative reducts are to be produced, the discernibility matrix should be 
computed as follows: 

Cij := {q G Q : d{ui^q) ^ d{uj^q) and ^ IND{D)^ for each pair i,j}, 

where ^ IND[D) means that the objects Ui and uj belong to different 

classes defined by the decision attributes. 

3 The Parallel Implementation of the Algorithm 

The most general idea of parallelization is decomposing the computing task into 
several subtasks which may be then processed at the same time on different pro- 
cessors. This requires, first of all, a specialized algorithm. The main objective of 
parallelization is reducing the computing time. Another important benefit may 
be the reduction of memory requirements (see [2] for a detailed description of 
different aspects of parallel algorithms). 

Parallelization of MRGA is based on its following characteristics. 

Observation In each iteration of the Step 2 {PHASE II) of the algorithm the fol- 
lowing holds: Si DTi = every R G is a copy of an R' G Ri-i] every R ^ I\ 
is created from a single R' G Ri-i by augmenting it by an element q ^ Q; MI Ni 
is a subset of 1] created by discarding some of its elements — the process is 
conducted on the basis of the index i and the value of ADLi i, 

Gonclusion 1 Any two R, R' G Ri are independent in the sense that no informa- 
tion on R or on how it has been created is required for producing R'. 
Gonclusion 2 The process of subset creation has a tree-like structure: in each 
iteration an element R G R^ may be either copied to the next iteration without 
changes or be deleted, giving rise to a collection of its proper supersets, which are 
subsequently processed independently of one another and of the other elements 
of R^. 
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Input: A set of objects U (|Lj^A0 described by values of attributes from the set Q\ 
tlie sub task count SC, tlie braucliiug factor BF (it is assumed that BF>SQ. 

Output: The set K. of all (absolute) reducts for tlie set U, 

PEA SFE As in Modified Reduct Generating Algoritlim 

PHASFEl 
St ep I 
RoH0}- 

Step 2 

Wliile (|/?o| <BF) and(i<d) do loop ; 

; RnC,^0}. 

U U {Ru(q}}. 

LjeCi ReRi-i:RnCi^0 

MIN,:^{Ren Mm(R,ADLt)^true}. 

Step 3 

If (t>d) tlien set Ki^R^ and stop tlie algoritlim, 

Otlierwise proceed witli Step 4, 

Step 4 

Partition tlie set R^ into SC subsets Ri, (j-L.SC), sucli that R^n R^ =0 for each kE, 

sc 

[J ^ ^ /?! and tlie subsets Rl are of approximately equal ske. 

;■=! 

Step 5 

Execute .ST) parallel subtasks, each perfoniiing tlie procedure K?:ESUB{i,d^i/iDIA, for j^d..SC. 

sc 

Set K;^ [J K'^' and stop tlie algoritlim, 

;■=! 



Procedure SUB ( Intttal Index , Final Index , PartialSolution , AhsorhedDlscernihiUtyIJst ) 
Step / 

^lnkEai]ndex-i;^P^ri/a/.So/M/70«. 

Step 2 

For iEnitiaIIndex,.FinaIIndex compute; 

.Sj;-{RGR^i ; RnQ;^0}. 

r.:= IJ IJ {fiu(q}}. 

i/eCi ReRt-v.RnCtit0 

MIN^^iReTc Mn(R,ADLp^true}. 

R^^SyjMM,. 

Return RFni^iindex as tlie result of tlie procedure. 



Fig. 2. The Parallel Implementation of MRGA 
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The conclusions lead to a formulation of the parallel algorithm for reduct 
computation. After completing the PHASE I of the sequential algorithm and 
performing several iterations of the Step 2 {PHASE //), the resulting Hi may 
be split into a family of subsets which create a partition of the set i.e. all of 
them are pairwise disjoint and they all sum up to the set Ri. From now on the 
computations may continue in form of several subtasks, each of which proceeds 
with exactly one partition substituted for the whole set R^. It is important that 
the resulting subtasks are absolutely independent of one another. The final set 
of reducts is the sum of the sets of reducts generated by each subtask. 

The implementation of the parallel algorithm requires two additional param- 
eters: the number of subtasks which are to be initiated (subtask count, SC) 
and the cardinality of the set Ri which is to be reached before the algorithm 
branches into multiple subtasks (branching factor, BE). The branching factor 
must be equal to or greater than the number of subtasks so that each subtask 
may start with at least one element of Ri. The role of this parameter is closely 
discussed in Section 5. 

There are no formal restrictions as to the method of partitioning the set 
Ri into disjoint subsets. In the described experiments this was implemented 
simply as assigning the attribute subset number n to the subtask number rn = 
(n mod SC) + 1. 

4 Experimental Evaluation of the Parallel Implementation 

The data set used in the experiment was a medical file containing descriptions of 
343 patients after the ESWL treatment [8]. Each patient is described by a vec- 
tor of 33 condition attributes and one decision attribute (defining two decision 
classes). The file was selected for presenting the results of the experiments only 
because of its computational characteristics: a large number of relative reducts 
(38207), a moderate computing time (327 seconds) and small memory require- 
ments (about 380 kB). It is stressed that this paper contains no claims as to 
any potential suitability of the selected data file for reduct generation or the 
usefulness of the generated reducts in further analyses. 

The practical behaviour of the parallel implementation may be best charac- 
terized by plotting the computing times (understood as the maximal computing 
time of all of the subtasks + the time required to reach the branching point) 
against the increasing number of subtasks employed (see Eigure 3). The same 
applies to the maximal memory requirements (Eigure 4). 

The main computing platform in the experiments was a SUN Workstation 
equipped with a Spared processor running at 110 MHz. 

Computing Times 

Inspection of Eigure 3 allows to conclude that the parallel implementation of 
the algorithm reduces the computing time considerably. The fact that the de- 
crease of the computing time is worse than linear may be explained as follows. 
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The problem of decomposing the original problem into subtasks is unbalanced in 
the sense that the number of reducts generated by each of the subtasks is highly 
variable. It results from the fact that it is impossible to determine in advance 
how many reducts will be finally generated by a given subtask, what, in turn, 
results from the highly variable ‘productivity’ of different attribute subsets ini- 
tially assigned to the subtask. Experiments indicate that running the algorithm 
with the value of the branching factor equal to the number of subtasks (which is 
the lowest possible value of this parameter) is highly unbalanced, including sit- 
uations where the maximal subtask computing times are close to the computing 
time of the sequential implementation. A solution to this seems to be increasing 
the value of the branching factor. This results, first of all, in a direct increase 
of the initial computing time (it takes longer to reach the branching point of 
the algorithm) but it has also a positive effect on balancing the subtasks and, in 
longer perspective, proves to be beneficial. 

The general conclusion is that the computing time decreases satisfactorily 
only after the branching factor is set to a value which enables each of the sub- 
tasks to start with at least several attribute subsets. This is, however, only an 
improvised, temporary principle — discovering optimal (provided there exists 
one) or close to optimal value of the branching factor needs further research. 



Maximal Computing Times 



— 




























~n ,rm ,rm , 



1 2 4 6 8 10 

Number of subtasks 



□ BF=50 
■ BF=100 

□ BF=200 



Fig. 3. Maximal computing times against the increasing number of subtasks and 
different branching factors 



Memory Requirements 

Another very important feature of the parallel implementation is the charac- 
teristics of its memory requirements. Inspection of Figure 4 reveals that the 
maximal memory requirements of the sequential algorithm are considerably re- 
duced by its decomposition into subtasks, the reduction being most evident for 
a small number of subtasks and becoming less evident with more subtasks. The 
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fact that the decrease of memory requirements is worse than linear may be ex- 
plained as follows. There are two main memory-consuming structures of the 
algorithm: the set Ri and the initial (unabsorbed) form of ADLi^^i. Usually the 
size of Ri exceeds considerably that of the unabsorbed list, but after partitioning 
of Ri the size of the unabsorbed list starts to dominate, obliterating the effects 
of partitioning. 

Applying parallelization with small numbers of subtasks, however, still seems 
to be a very good alternative, especially in computing environments where the 
memory proves to be the bottle-neck of the process. 



Maximal Memory Requirements 
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Fig. 4. Maximal memory requirements against the increasing number of subtasks 
and different branching factors 



5 Conclusions and Directions of Future Research 

This paper introduces a parallel algorithm for computing all exact reducts of a 
decision table. The algorithm is based on an algorithm described originally in 
[7]. Results of experiments aimed at verifying the algorithm’s practical behaviour 
are also presented and discussed. The results indicate usefulness of the parallel 
implementation, which allows for the reduction of both the computing time and 
memory requirements of the reduct generating process. 

There still remain some issues which need further research. The main of them 
are: establishing the number of subtasks to be started and balancing the subtasks 
so that they generate approximately equal numbers of reducts and exhibit similar 
computing times and memory requirements. 

In its current version, the parallel algorithms branches at one point into an 
arbitrary number of subtasks. Such a solution does not take into account the 
computational needs of the particular data set for which the algorithm has been 
invoked, because it is not possible, in general, to determine the final number of 
reducts in advance. In an alternative approach, the number of subtasks would 
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not have to be specified — the only control parameter being the branching factor. 
Upon reaching it, the algorithm would branch into two parallel subtasks, with 
each of them proceeding with a half of the current load. The same branching 
procedure would be subsequently applied recursively in both resulting subtasks. 

A related problem is that of balancing the resulting subtasks. Presently, the 
parallel algorithm assigns the initially created attribute subsets one by one to 
consecutive subtasks. Because the original ordering of the attribute subsets is 
random, this partitioning may also be viewed as random. It must be stressed, 
however, that the ordering of the absorbed discernibility list, from which the 
attribute subsets are created, is not random — the list is ordered in ascend- 
ing order of its elements. As a result, the initial attribute subsets contain those 
attributes that occurred in low-cardinality elements of the discernibility list. A 
better balancing scheme could incorporate some attribute counts, so that the 
numbers of different attributes in the initial subsets would be approximately the 
same. 
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[ Abstract.] In this paper we present an approach to develop 
decision support systems. We focus on knowledge acquisition 
and processing with the use of rough set theory. The rule knowl- 
edge representation is also considered. We discuss the use of 
Prolog as a tool for knowledge representation and processing. 
Finally, the way of embedding Prolog code in procedural lan- 
guage programs is presented. Our work is illustrated with an 
exemplary system supporting credit decisions. 

1 Introduction 

Systems for supporting decisions are one of the artificial intelligence domain 
widely applied in practical solutions. By systems for supporting decisions we 
mean programs that perform inquiries of specific problems. Typically these sys- 
tems gather an explicite knowledge base and control mechanisms mutually sep- 
arated. Another advantage is the possibility of giving explanations for the solu- 
tions found by the inference engine. 

The technology of expert systems in the field of economy has many applica- 
tions, i.e. in banking [S195,Sk96], management [Sr94], finances [Kr95], insurance, 
investments and marketing [Ba94a,Po97]. 

The main components of an expert system are as follows: a knowledge base, 
an inference engine and a user interface (see Fig. 1). The synthesis of an ex- 
pert system should be preceded by an analysis of three problems: knowledge 
acquisition, knowledge representation and knowledge utilization. 

One knowledge acquisition methods is obtaining knowledge from previously 
existing examples of decisions made by experts. Such tactics of learning is often 
defined as an inductive method [Mi83] . From the various techniques of knowl- 
edge processing, compared in the article [Po97], we have chosen and applied the 
rough set theory introduced by Pawlak [Pa82,Pa91] with extensions described 
in [Pl,S192,Ba94b]. We used DataLogic system [D192] to transform the obtained 
knowledge from the database into the form of the rule knowledge base. The re- 
sulting knowledge base consists of decision rules which, in turn, can be directly 
presented in the form of Prolog rules [Me89] . Prolog as a language of logic based 
on predicate calculus [Ch73,C184] is a perfect tool for knowledge representation 
and for developing an inference engine. In the article we presented a hybrid ap- 
proach to developing of expert systems. The inference engine and the knowledge 
base were implemented in the language of logic, whereas the user interface and 
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Fig. 1. Main components of the expert system 



database modules were developed using classical procedure programming. Such 
connection was possible due to Amzil Logic Server [Am95]. 

This article presents a practical implementation of an expert system for sup- 
porting credit decisions and application of mentioned above mechanisms. 

2 Knowledge Acquisition - Rough Approach 

It is accepted that a decision problem can be described by: finite parameter 
set C = {ci,C 2 , . . called further condition attributes^ finite parameter set 

D = {di, ^ 2 , . . . , dp\ called further decision attributes^ and determined domains 
(ranges of values) respectively: 

Vc = Vci for condition attributes (1) 

CiEC 

and 

Vd = Vdj for decision attributes. (2) 

djC D 

Such description of a decision problem causes that its solution consists in 
determination of condition attribute values and, on this ground, setting values 
on particular decision attributes. Such a process of solving decision problems can 
be represented as a decision protocol. Fig. 2 shows the structure of this decision 
protocol. 
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Fig. 2. Rough set decision table 



It is easy to notice that the decision protocol in Fig. 2 contains decision rules. 
They can be denoted as conjunction of elementary conditions. 

For j-th decision we have: 

if {(ci = vi^) & (c 2 = vi^) & . . . & (cfc = vi^)} 

then |(di = | (3) 

Having the suitable decision protocol (containing e.g. expert decisions solving 
definite problem) the problem of knowledge acquisition can be reduced to the 
problem of generation all different decision rules in the form (3) based on such 
protocol. 

We can use for this purpose efficient software tools based on rough set the- 
ory as e.g. LERS [Gr92], DataLogic [D192]. The main function of these tools is 
the transformation of data bases (expressed in a form of decision protocols or 
semantically equivalent decision tables) into the form of a reduced decision rule 
base (a knowledge base). 

3 Rule Knowledge Base Representation 

Knowledge representation is a formal way of knowledge projection in order to 
efficiently store and process it in a computer memory. 

In Prolog the knowledge base consist of a set of facts and rules. Prolog makes 
possible recording in the knowledge base not only the facts but also the informa- 
tion about relations between the facts in the form of rules that are called Horn 
clauses [C184]. 

Prolog Horn clauses accept also the usage of condition alternatives and dis- 
junction at the same time [Ch73]. During the decision rule analysis in the form 
(3) it is not difficult to notice the possibility of its natural representation in the 
form of an adequate Prolog Horn clause. 

4 Prolog Inference Engine 

Prolog as a language of logic is favorable for implementations of inference sys- 
tems. Prolog inference engines are based on search strategies. A built-in unifica- 
tion mechanism as well as inference based on the resolution principle [Ch73] are 
also helpful during the inference process. 
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The most important features of expert systems are easy to implement in Pro- 
log [Me89]. These are: backward chaining, forward chaining, rule representation 
of data and explanations. 

Although the expert systems written by means of conventional languages, 
such as C, have often a better performance, the Prolog code makes expert sys- 
tems close to a logical specification of a program and thus easy to modify. 



5 User Interface 

For a user, the most important part of a computer system supporting deci- 
sions is a user interface. Interactive program environments become nowadays 
an indispensable tool for solving complicated problems related to the systems 
development. 

For this purpose, either intelligent communication algorithms or modern in- 
teractive methods can be used. Intelligent communications algorithms make it 
possible to transform computer requirements into the human way of reasoning 
in the process of making decisions whereas modern interactive methods allow a 
user an efficient communication with a computer. In this way a human is able 
to watch particular stages of the decision process performed by computers, in- 
fluence its run and obtain, except of the final problem solution, intermediate 
information of various type. 

6 Logic Server 

Logic server enables Prolog predicates to be integrated with conventional pro- 
gramming languages, databases and other programming tools. In our research 
we use Logic Server produced by AmziL It supports the memory management 
separated from a base application. It includes full graphic interface that makes 
possible to access Prolog module from outside. Interface libraries are supported 
to many popular languages such as: C++, Delphi, Visual Basic, etc. 

The structure of the complete decision supporting system is presented in 
Fig. 3. 



7 An Example of the Decision Support System 

An example of an economic decision problem presented below intends to point 
at the elements of rough set theory as a formal tool that can be efficiently used 
in synthesis of appropriate computer systems supporting their solution. 

Granting credits to individuals or businesses belongs to the fundamental 
duties and functions of modern banks. Such activity includes a certain level of 
risk. That risk results from the difficulties of explicit determination of so called 
credit capacity of a debtor, i.e. the possibility of credit repayment including 
payable interest [Si93]. At the stage of credit terms negotiations, the contrary 
interests of banks and debtors occur. 

This contradiction of bank and debtor interests as well as incompleteness, 
inexplicitness, uncertainty of available information and difficulties with selection 
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Fig. 3. The complete decision supporting system 



of the parameters and criteria allowing for objective credit capacity evaluation, 
make the credit decision problem difficult to formalize. 

This fact influenced the attempt of synthesis of computer system supporting 
economic decision based on rough set theory elements at the stage of knowl- 
edge acquisition and representation. Particular description of such synthesis was 
presented in [Sk96]. 



7.1 Knowledge Sources and Acquisition 

From the formal point of view, bank crediting process consists of two partial 
problems: 

— preparing the premises for decision making i.e. honest complete and credible 
valuation of debtor’s credit capacity; 

— opening the credit up to the certain limit and on the condition that minimize 
a risk. 

The resolution of this problem must concern the solutions of both partial 
problems. Taking advantage of accessible publications, possibilities of discussion 
with experts and observations of credit decisions made by banks in real condi- 
tions makes possible the knowledge acquisition. It was in details described in 
[Sk96]. 

The following economic indexes useful in valuation of credit capacity have 
been chosen [Si93]: 







Rough Rules in Prolog 



463 



— Net Profitability Ratio 



— Current Ratio 



Indl 



Net Profit 
Sale Value 



100 % 



Iud2 



— Quick Ratio 



Current Assets 

Current Liabilities + Short-term Credit 



Iud3 = 



Current Assets — Stock 



Current Liabilities 

— Accounts Receivable Turnover Ratio 



Iud4 



Average Accounts Receivable 
Net Sale 



— Inventory Turnover Ratio 



Iud5 

— Exceeded Payables Ratio 



Average Stock 
Net Sale 



— Equity Ratio 



Iud6 



Exceeded Accounts Payables 
Total Payables 



Iud7 

— Ownership Capital Ratio 



Outside Capital 
Ownership Capital 



IndS 

— Interest Coverage Ratio 



Ownership Capital 
Total Assets 



IndO 



Interest 

Sale 



100 % 



Condition and decision attributes as well as their domains were determined 
with a help of experts’ suggestions and economic indexes described above. The 
data were recorded in a relevant economic decision protocol which makes the 
basis of the knowledge acquisition process. General scheme of decision protocol 
was presented in Fig. 2. 



Condition Attributes. Here are the selected condition attributes: 

[ci] - Net Profitability Ratio - noted as Indl^ domain: {acceptable^ unacceptable} 

[C 2 ] — Net Profitability Ratio tendency - domain: {increase^ decrease} 

[ca] - Net Profitability Ratio in comparison with the other companies of a branch — 
domain: {high, low} 

[C 4 ] — Current Ratio - noted as Ind2, domain: {acceptable, unacceptable} 

[cs] - Quick Ratio - noted as IndS, domain: {acceptable, unacceptable} 

[ce] - Accounts Receivable Turnover Ratio — Indd, domain: {acceptable, unacceptable} 
[cr] — Inventory Turnover Ratio - noted as Ind5, domain: {acceptable, unacceptable} 




464 



A. Mrozek, K. Skabek 



[cg] — Exceeded Payables Ratio - noted as Ind6^ domain: {acceptable^ unacceptable} 
[cg] - Equity Ratio - noted as /nd7, domain: {acceptable^ unacceptable} 

[cio] - Ownership Capital Ratio - noted as IndS^ domain: {acceptable^ unacceptable} 
[cii] — Ownership Capital Ratio tendency - domain: {increase^ decrease} 

[C 12 ] - Interest Coverage Ratio - noted as IndfJ^ domain: {acceptable^ unacceptable} 

In this way we have obtained the set C of condition attributes 

c= {ci,C2,...,Ci2}. 

Decision Attributes. Risk rating applied in practice by crediting banks allows 
us to classify credits into the following groups [Sk96]: 

[Group 1] - ordinary credit. 

[Group 2] - observed credit. 

[Group 3] - doubtful credit. 

Accordingly to the above classification, it was assumed that the credit risk 
group is the only decision attribute. 

Finally we obtained the single element set D of decision attributes [D = 
{di}). The above mentioned groups of risk make the domain of the set D, 

Establishing of condition and decision attribute sets has explicitly determined 
the structure of the relevant decision protocol. This protocol was helpful in data 
acquisition process. 

Credit decisions recorded in protocol cases have arisen from practical bank 
consultations (they took place during student internships) and from availiable 
specialistic publications [Si93]. The complete data set contains 512 rules and is 
described in [Sk96]. 

7.2 Knowledge Reduction 

The data set has been reduced by means of DataLogic [D192]. 

The main function of this program is the reduction of data sets into the 
form of decision rules. The process of generating decision rules consists of the 
following stages: reduct searching, redundant attributes exclusion, redundant 
record reduction. 

The complete decision table consisted of 512 items. After the reduction pro- 
cess the knowledge base included 150 decision rules. 

Three condition attributes (c 2 , C 3 and cn) were found unneccessary and could 
be removed. The reason of it is that these attributes consider the tendency and 
the comparison with the other companies of the branch, values of which already 
exist in the set of attributes. However, because of a possibility of the knowledge 
base extension, they remained as parameters in the program. As the application 
makes possible to record new cases in the knowledge base, these parameters may 
become useful during the system exploitation. 

The particular decisions included the following numbers of rules: 

— DOUBTFUL — 55 rules, 

— OBSERVED — 81 rules, 

— ORDINARY — 14 rules. 
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The complete set of decision rules is published in [Sk96]. 

In the rough analysis the attribute strength report is also very important. It 
appears that the attributes remaining in the knowledge base have approximately 
equal volume (the highest difference reaches 17%). It means that particular at- 
tributes have a similar influence on the final decision. 



7.3 Implementation of the Inference Engine 

The credit decisions support system includes the inference engine which performs 
the following tasks: 

— reads contents of the credit database; 

— accepts and removes facts about economic indicators of an analyzed com- 
pany; 

— performs the inference; 

— gives the explanation “iJOITf”; 

— displays all facts currently available in the system; 

— displays all rules from the knowledge base. 



7.4 Modular Structure of the System 

The complete system for supporting credit decisions consists of the following 
modules: 

— module for data processing - used for edition of credit applications, fllling 
information about companies and their financial reports; 

— module for monitoring of economic indicators - used for current analysis of 
economic indicators; it provides an instant overview of indicator values; 

— interface engine with a rule knowledge base - performs credit risk estima- 
tion on the basis of economic ratios of the company being examined; the 
module includes a built-in explanation mechanism and also allows a user the 
knowledge base modifications; it was developed in Prolog. 
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Abstract. As is shown by the results of some recently proposed meth- 
ods, the perceptual coding of audio can be used for suppressing the noise 
affecting transmitted audio signals. The process of tuning the perceptual 
audio coding algorithm demands finding the relations between masking 
algorithm parameters and their influence on the subjective quality of pro- 
cessed audio. The rough set method was employed to discover ill-defined 
relations underlying the implemented perceptual model of hearing. 



1 Introduction 

Presently, the majority of audio signals transmitted or recorded in digital sys- 
tems is perceptually encoded. However, as is shown by the results of the lately 
proposed methods, the perceptual coding of audio can be also used for sup- 
pressing the noise affecting audio signals [1]. Such a noise can be caused by the 
recording procedures or by the transmission of audio signals through telecom- 
munication channels. Thus, the application of the perceptual coding may not 
only prevent the original quality of audio, but it can also subjectively improve 
it. The masking curves providing the basic mechanism of this algorithm divide 
the spectral magnitudes of audio signal to two categories: audible components 
(stretched beyond masking curves) and not-audible ones (remaining below these 
curves). When determining the settings of the masking model, we have to base 
on the experimental procedures. The parameter values to be optimized could 
be discerned only by listening tests, yet their results do not reveal clearly the 
sought dependencies between parameters and audio quality. Hence, the rough 
set method was employed by the authors to facilitate the process of optimiz- 
ing the perceptual algorithm for noise reduction. A special test procedure was 
elaborated for this purpose, which uses both: principles of psychometric scaling 
and the rule base building method based on rough sets. The perceptual cod- 
ing algorithm and the proposed soft computing method for its optimization are 
presented in this paper. 

2 Outline of the Perceptual Method for Noise Reduction 

The method consists mainly in the constant analysis of the signal-to-noise re- 
lations in the consecutive sample packets, calculating the optimum shape of 
masking curves and processing the noisy audio using the perceptual coding al- 
gorithm. The noise sample is always taken from a silence passage nearest to the 
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currently processed fragment of the signal. On the basis of the spectral power 
density of noise in neighbor “silence passages “ and the spectral power density of 
noise in the current signal packet, the masking threshold is raised separately in 
each critical band in order to keep the noise below the masking curves. The de- 
termination of the absolute threshold of hearing is calculated as in the literature 
[ 2 ]: 



Tq = 3.64 • f-^ - 6.5 • exp[-0.6 • (/ - 3.3)^] + 10^^ • f (1) 

where: Tq- level of hearing threshold [dB]^ f - frequency [kHz], 

The linear frequency scale is transformed into the critical band-related Bark 
scale on the basis of the dependencies found by Zwicker [2] : 

6 = 13 • arctg{0,76 • /) + 3.5 • arct^[(//7.5)]^ 

where: b - frequency [Bark]^ f - frequency [kHz], 

The approximation of the masking phenomenon caused by the excitation 
component on the Basiliar membrane in the inner ear may be approximated by 
two segments inclined at angles Ai, S 2 (see Fig.l): 

= 27 

S2{E,i) = 24 + 0.23 •/-!(*) -0.2 •£;(*) 

where: Si , S 2 - inclination measures expressed in [dB / Bark], 
i - critical band number, i = 0, 1, ..., 24, 




Fig. 1. Approximation of the masking threshold 



According to Johnston [3] the distance between the excitation level and mask- 
ing level 0[i) can be determined on the basis of the following relationship: 

0[i) = q;( 14.5 + i) + 5.5 • (1 — o;) (4) 

where: o;(0<Q;<l)-is the so called tonality index which may be computed 
on the basis of the N-point Fast Fourier Transform Algorithm [3]. 

For the need to hide noise affecting the useful signal, the masking thresholds 
T(i) should be raised properly, what may be achieved numerically through setting 
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appropriate values to them. This task can be done, when additional variable p{i) 
is added to the formula (4) as follows: 

0{i) = q;(14.5 + i) + 5.5 • (1 — o;) + anP{i) (5) 

where: an - is the tonality index for a sample of noise taken from a silence 
passage nearest to the currently processed fragment of the signal 

Since masking is not a local phenomenon (it influences a certain bandwidth), 
thus the level of masking for fho frequency hk[Bark] caused by 

the presence of the component of the frequency is derived, allowing to define 
masking threshold as follows: 



T{i) = (i)-o{i)/w 

where: 

E (i) = Bij • Sp{i) (7) 

and: E {i)~ integrated excitation in i-th critical band Bij{i^j) - spreading func- 
tion for the distance between critical bands Ab = i — j , (this function can be 
given in the form of a look-up table [3]) Sp[i) - spectral power density in the 
Tth critical band. 

Another phenomenon which should be taken into the consideration is the 
post-masking (the influence of previously occurring excitements to the current 
masking effect). The post-masking can be modeled by the dependence: 

St{i,k) = S,{i,k) + P{i)-St{i,k-1) (8) 

where: spectral power densities for fc-th and (fc — l)-th sample 

packet Sa{i:k) - transformed spectral power density in Ath critical band: 

Ogi 

Sa{i,k) = ao{bi) ■ £ (9) 

Q=Qdi 

where: ao(6^) - coefficients of transformation, Sp[eJ^ ^ k) - spectral power density 
for the k-th sample packet 

The energy transmission coefficient T{i) used in the equation (8) is given by 
the relationship: 



Tr{i) = ( 10 ) 

where: d - time lag between consecutive sample packets [ms], r{i)- time con- 
stant [ms] depended on the Ath critical band. The post- masking features can 
be supported by an algorithm, provided Sp[i) in the equation (7) is replaced by 
Sf{i^ fc)calculated basing on eq. (8). 
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3 Tuning the Model 

As results from the previous paragraph, in order to determine the masking ef- 
fect, some values have to be computed for each signal packet, among others 
Si[i)^ In order to calculate these values within a given 

critical band, three parameters should be defined for each critical band i — 
1, 2, ..., 24, namely r(i), and ao{i) 

This results in the set containing (3 • 24 = 72) parameters to be tuned. A 
proper selection of these parameter values allow one to control the process of 
noise removal. Practically 24 critical bands were grouped in three regions: low 
frequency (1 to 8 band), mid-frequency (9-16 band) and high frequency (17-24 
band). The three parameters to be optimised, namely /?(i), r(i), and ao[i) were 
set identical in each group of critical bands. Consequently, (3*3 = 9) parameters 
were subjected to the optimisation procedure supported by the soft computing 
data processing, as is described in the next paragraphs. 



4 Subjective Testing Procedure and Soft Processing of 
Results 

The most popular subjective quality testing method is the paired comparison 
test. The goal of this method is to compare objects ordered to pairs and to assess 
them on the basis of a two-level scale (better/worse attribute scale). Technically, 
signal samples are presented in A-B order. The expert task is to choose the better 
one from a pair of sound samples that differ in acoustic features. As a result a 
certain number is assigned to each compared sound sample. That number reflects 
the experts’ overall preference. The statistical analysis of test results cannot 
reveal hidden relations between tested parameters nor give the rules instructing 
one on how to tune a system based on such parameters. That is why the soft 
computing rule-based system (the rough set method) was employed to this task. 
There are many other effective machine learning algorithms, however, in the 
studied case there is a necessity to fulfil some special demands to make the 
acquired knowledge base applicable to the task of tuning the audio processing 
algorithm. The demands are as follows: 

- the knowledge should be presented in the form of a set of readable rules 

- the rules should be associated with a belief measure, allowing to rank them, 
because it is not probable to induce only certain rules on the basis of subjective 
opinions of many experts (there is expected a wide margin of uncertainty) 

- the system should be able to deal with values expressed by ranges 

Above considerations led authors’ choose the rough set decision system as a 

tool matching the demands related to the processing data acquired on the basis 
of subjective opinions. Let’s assume that the number of assessed audio patterns 
(related to a single tuned parameter of the algorithm) is set to X, the number 
of subjects involved in the subjective listening session equals T, and Anally, the 
number of test series equals Z. Hence, the maximum number of answers in a 
paired comparison test equals: 



N = Ni-YZ 



( 11 ) 
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where Ni~ is the number of pairs assessed by an individual expert in one series 
of a test calculated according to the following formula: 

N, ^ (-) ^ 

O; (X-2)!-2! 

In this way testing 10 audio patterns by 5 experts in 2 series will result in 450 
votes which are distributed between objects. The test was performed 9 times, 
because it was 9 parameters: /?(1), /?(2), /?(3), r(l), r(2), r(3), ao(l), ao(2), ao(3), 
which were related to 3 frequency regions. The preference diagrams were ob- 
tained on the basis of experts’ answers (see the example in Fig. 2). The indi- 
vidual objects (A to J) were assessed when paired with all others. Because each 
object was ranked higher certain times by some experts when compared to an- 
other objects, the previously mentioned 450 votes are distributed among objects, 
as is seen in Fig. 2. The preference curve reveals some maxima for the objects 
processed with certain parameter settings. Thus, the plot indirectly provides in- 
formation on perceptible differences between parameter values. In this way the 
so-called perceptual quantization can be done. Usually, for practical needs, the 
number of preference ranges could be diminished, so the scale of number of votes 
reflecting this preference could be for example decimated. Hence, in the showed 
example (Fig. 2) the vertical axis can be re-scaled, reflecting preferences in the 
range of 0 to 5. 



( 12 ) 




Fig. 2. Number of votes given to each tested object when assessed in pairs, reflecting the 
degree of subjective preference of objects by experts (the test procedure was repeated 
twice). A — J are audio patterns related to 10 values of one parameter (in the case 
presented in figure it was parameter ao(i)) 



Having the parameter values discretized to 5 ranges, it is possible to generate 
5® = 1953125 audio patterns related to all combination of values of 9 param- 
eters. Obviously, such number of tests would be completely impractical. Thus, 
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the number of combinations was randomly decreased to 4096 (data compression 
ratio about 500). This operation makes the subjective testing realizable, however 
imposes a wide margin of uncertainty to the data processing algorithm. A special 
computer program was prepared to generate and play patterns automatically, 
without the need to store the resulting audio files. The grades were acquired from 
experts using special keyboards interfaced to the computer. Each audio pattern 
contained 5[s] of mixed music and speech. This resulted in 4096 • 5[s] = 6[h] 
of music assessed by the experts (plus 3[h] for pauses between fragments). The 
sessions were organized during 6 days in order to prevent experts’ from getting 
tired. After listening to each pattern, each of experts proposed a grade of overall 
preference according to the 5-point preference scale. The experts were asked to 
assess the noise reduction effect and the audible distortion level, simultaneously. 
Since it were 5 experts employed, thus a database containing 5 • 4096 = 20480 
records was created. Because the number of test results can be so large, thus the 
relations between parameters remain hidden until they are discovered by an au- 
tomatic rule induction algorithm. The large portion of missing combinations and 
inconsistency of the database left a wide margin of uncertainty to be managed 
by the soft computing algorithm. 

4.1 Rough set-based analysis of test results 

In order to discover tendencies underlying the choice of the overall quality as- 
signed by experts to particular combinations of parameter values, the rough 
set-based analysis was performed. The perceptual quantization of parameter 
values was used which was introduced in the previous paragraph. The Quality 
was defined as the decision attribute (12), all other attributes included in ta- 
ble are used as condition attributes (Aj^). They represent previously introduced 
perceptual algorithm parameters. The values {a^i) in table are filled in after as- 
signing grades to real parameter values on the basis of the previously executed 
perceptual quantization (see chapter 3). The table obtained in such a way is 
highly inconsisted, mainly because different combinations of parameters result 
in the same overall subjective preference. 

The number of subjects involved in the parametric test is equal to Y . There- 
fore, the total number of experts’ tables equals T, on the other hand the number 
of rows in each table equals n. The number of parameters is p . Consequently, 
after summing up the results provided by all experts, a data base is created in 
the tabular form, which contains Y • n • (p + 1) records; (p + 1) represents the 
number of attributes including the decision attribute. After deleting duplicated 
rows (superfluous data elimination) and finding reducts [4] the reduced sets of 
rules is obtained which contains the knowledge of the masking algorithm pa- 
rameters values which are preferred by the experts. The final step is generating 
rules from the reducts. However, in the specific case related to the discussed field 
of application, the whole set of rules should be used initially as a guidance for 
tuning the perceptual coding system. That is caused by the fact, that the re- 
duced rules may not show how to set the values of some parameters. Since some 
rules contain only some parameters to be set, thus the sum of rules generated 
from reducts should be taken into consideration. The practical way of tuning 
the system using the set of rules is described in the next chapter. 
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5 Experimental Procedures 

The 10 values of each parameter within each range was selected for testing. In 
this way total 90 parameter values were defined in each of 3 frequency ranges. 
Then, 10 speech and music patterns (each of them of time duration approxi- 
mately 5[s]) were processed using all 10 previously defined values of perceptual 
coding algorithm parameters. Consequently, after the processing, 9 sets of 10 
objects (audio patterns) were obtained. Subsequently, each 10-element set of ob- 
jects related to a defined parameter was transformed to the set of ( 2 *^) = 45 
pairs to be assessed subjectively. Then, the experts provided their opinions and 
the number of votes given by them to each object was computed. The scale of 
subjective grades was decimated, so the perceptual quantization of parameter 
values was obtained. After listening tests the decision table was built up accord- 
ing to the previously discussed scheme (see Tab. 1). The rough set algorithm 
elaborated at the Sound Engineering Department of the TU Gdansk was used 
[1]. This algorithm generated rules from this table, of various rough measure val- 
ues (within the range of < 0.5, 1.0 >). After the calculation of reducts the new 
rule set was generated (basing on reducts), containing 36 strongest (/x rs > 0.8) 
rules of the length from 2 to 9. The rules were ordered in such a way that the 
shorter and the stronger were listed first. The strongest rules were used during 
the experiments as a guidance for tuning the perceptual noise reduction system. 
The exemplary rules obtained in this way were as follows: 

(/?(2) = 3)and(/?(3) = 1) => {Over all .Quality = 4), fiRs = 1 
(/?(3) = 0)and(r(2) = 4) => {Over all. Quality = 1), yjis = 0.816 
(/?(!) = 4)(/?(2) = 3(/?(3) = l)(r(2) = 3)(r(3) = 2) => (5), yns = 1 
(/?(!) = l)(/?(2) = l)(r(3) = l)ao(3) = 0) => (2), yns = 0.803 
(/?(!) = 4)(/?(2) = 3)(/?(3) = l)(r(3) = l)(ao(2) = l)(ao(3) = 3) => (5), 
t^RS = 0.911 

(/?(!) = 4)(/?(2) = 3)(/?(3) = l)(r(l) = 2)(ao(l) = 2)(ao(2) = l)ao(3) = 
3)=> (5 ),/x^5 = 0.81 

No one rule, which employs all 9 conditional attributes was found asso- 
ciated with the decision showing the highest grade of subjective preference 
{Over all. Quality = 5). Consequently, using the induced set of rules the pa- 
rameters of the perceptual coding system were set in such a way that first were 
considered the shortest and the strongest rules. After some additional listening 
the first rule showing the best grade (= 5) was applied to set /?(1) to 4 and 
f3{2) to 3 and /?(3) to 1 and r(2) to 3 and r(3) to 2. The remaining parameters 
values (r(l), ao(l), ao(2)and ao(3)) were set on the basis of the two last rules 
listed above, namely r(l) to 2 and ao(l) to 2 and ao(2) to 1 and ao(3) to 3. 
Final listening test showed that these settings were acceptable by the experts. 

6 Conclusions 

The concept of perceptual quantization has been introduced in this paper, pro- 
viding a usable method of replacing values by ranges in psychometric scaling. 
Despite the data reduction obtained with the use of the perceptual quantization, 
the number of combinations of parameter settings was too large for practical lis- 
tening tests. The reduction of the number of such combinations made the testing 
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procedure uncertain and the table of subjective preference highly incomplete. In 
such conditions the statistical processing of data cannot help to find the optimum 
values of tested parameters. The rough set algorithm dealing with inconsistency 
and missing data allowed to generate the rules which were applied for tuning 
the perceptual noise reduction system. 
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Abstract. This paper discusses the characteristics of medical reasoning 
and shows the representation of these diagnostic models by the use of 
rough set theory. The key ideas awe both a variable precision rough set 
model, which corresponds to an ordinal positive reasoning , and an upper- 
approximation of a target concept, which corresponds to a focusing pro- 
cedure. Acquired representation suggests that rough set model should be 
closely related with medical diagnosis. 



1 Introduction 

Medical reasoning always includes uncertaintyfl] , which is caused by the limita- 
tions of medical knowledge, available data and our recognition, compared with 
the complexities of human body. Thus, medical databases also have a certain 
degree of uncertainty: rules extracted from databases are also incomplete, which 
suggests that rule induction method should deal with uncertain rules. 

According to this motivation, rule induction based on rough set theory have 
been applied to medical databases empirically [6-8, 10], the results of which show 
that rough-set-based methods are very useful to extract medical diagnostic rules. 

This paper presents how medical diagnostic rules are modeled by the concepts 
of rough sets[5] in a more theoretical way. The key ideas are both a variable 
precision rough set model, which corresponds to an ordinal positive reasoning, 
and an upper approximation of a target concept, which corresponds to a focusing 
mechanism in medical reasoning. Acquired models show that the characteristics 
of medical reasoning reflect the concepts on approximation of rough sets, which 
explains why rough sets work well in medical domains. The paper is organized as 
follows: in Section 2, two important measures, accuracy and coverage are defined 
and a probabilistic rule is defined. Section 3 to 5 presents description of three 
types of medical reasoning: simple differential diagnosis, focusing mechanism and 
m— of— n criteria, respectively. Section 6 concludes our paper. 

2 Probabilistic Rules 

In this section, a probabilistic rule, which is a basis for describing diagnostic rules, 
is defined by the use of the following three notations of rough set theory[5]. In 
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the following, these notations are illustrated by a small database shown in Table 

1 . 

First, a combination of attribute-value pairs, corresponding to a complex 
in AQ terminology[4], is denoted by a formula R. For example, [age = 50 — 
59] &[/oc = occular] will be one formula, denoted hj R= [age = 50 — 59] &[/oc = 
occular]. Secondly, a set of samples which satisfies R is denoted by [x]r^ corre- 
sponding to a star in AQ terminology. For example, when {2, 3, 4, 5} is a set of 
samples which satisfies [age = 40 — 49], [x][a^e= 40 - 49 ] equal to {2, 3, 4, 5}. ^ 
Finally, t/, which stands for “Universe”, denotes all training samples. 



Table 1. An Example of Database 



age loc nat prod 


nau 


Ml 


class 


1 50-59 occ per 


0 


0 


1 


m.c.h. 


2 40-49 who per 


0 


0 


1 


m.c.h. 


3 40-49 lat thr 


1 


1 


0 


migra 


4 40-49 who thr 


1 


1 


0 


migra 


5 40-49 who rad 


0 


0 


1 


m.c.h. 


6 50-59 who per 


0 


1 


1 


psycho 



Definitions: loc: location, nat: nature, prod: 
prodrome, nan: nausea, Ml: tenderness of Ml, 



who: whole, occ: occular, lat: lateral, per: 
persistent, thr: throbbing, rad: radiating, 
m.c.h.: muscle contraction headache, 
migra: migraine, psycho: psychological pain, 
1: Yes, 0: No. 



2.1 Classification Accuracy and Coverage 



Definition of Accuracy and Coverage According to the notations above, 
classification accuracy and coverage (true positive rate) is defined as: 



o^r{D) 



\[x]Rnv\ 

iMi?i 



, and k,r{D) 



\[x]Rnn\ 

W\ 



where |A| denotes the cardinality of a set A, aR{D) denotes a classification 
accuracy of R as to classification of 44, and Rr{D) denotes a coverage, or a 
true positive rate of R to 44, respectively. In the above example, when R and 
44 are set to [nau = 1] and [class = migraine]^ aji{D) = 2/3 = 0.67 and 

rr{D) = 2/2 =1,0, 

It is notable that aR^D) measures the degree of the sufficiency of a proposi- 
tion, R^ and that Rr{D) measures the degree of its necessity. For example, 
if aR^D) is equal to 1.0, then 44 ^ 44 is true. On the other hand, if rr{D) is 
equal to 1.0, then 44 ^ 44 is true. Thus, if both measures are 1.0, then R ^ D, 



^ In this notation, “n” denotes the nth sample in a dataset (Table 1). 
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2.2 Probabilistic Rules 

By the use of accuracy and coverage, a probabilistic rule is defined as follows: 

Definition 1 ((Probabilistic Rules)). Let R he a formula (conjunction of 
attribute-value pairs) ^ D denote a set whose elements belong to a class or 
positive examples in all training samples (the universe)^ U , Finally^ let \D\ 
denote the cardinality of D. A probabilistic rule of D is defined as a tripule^ 
< R ^ d, aR[D)^ nji{D) where R ^ d satisfies aji{D) > Of 

In the following sections, all the diagnostic rules are represented as special types 
of the probabilistic rule above. 



3 Simplest Diagnostic Rules 

3.1 Representation of Diagnostic Rules 

The simplest probabilistic model is that which only uses classification rules which 
have high accuracy and high coverage. Such rules can be defined as: 

R ^ d s.t. R = ViRi = Vi Aj [aj = Vk]y 

> 6a and RRi{D) > 4^, 

where 6^ and 6^ denote given thresholds for accuracy and coverage, respectively. 
For the above example shown in Table 1, probabilistic rules for m.c.h. are given 
as follows (both 6^ and 6^ are set to 0.75): 

[prod = 0] ^ m.c.h. a = 3/4 = 0.75, r = 1.0, 

[nau = 0] ^ m.c.h. a = 3/3 = 1.0, r = 1.0, 

[Ml = 1] ^ m.c.h. a = 3/4 = 0.75, r = 1.0, 



3.2 An Rule Induction Algorithm 

An rule induction algorithm is defined as Figure 1, which is discussed precisely 
in [9] . It is notable that rule induction of other type rules is derived by simple 
modification of this algorithm. 

4 Focusing Mechanism 

One of the characteristics in medical reasoning is a focusing mechanism, which 
is used to select the final diagnosis from many candidates [8]. For example, in 
differential diagnosis of headache, more than 60 diseases will be checked by 

^ It is notable that this rule is a kind of probabilistic proposition with two statis- 
tical measures, which is one kind of an extension of Ziarko’s variable precision 
model(VPRS) [11]. 
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procedure Induction of Classification Rules; 

var 

i : integer; M,Li : List; 

begin 

Li := Ler; /* Ler'- List of Elementary Relations */ 
i:=l; M:={}; 

for i := 1 to n do /* n: Total number of attributes */ 

begin 

while ( Li / {} ) do 
begin 

Sort Li with respect to the value of coverage; 

Select one pair R = A[ai = Vj] from Li, 
which have the largest value on coverage; 

L, :=L, -{L}; 
if {ur{D) > Sr.) 
then do 
if {an^D) > S^) 

then do Sr := Sr + {R}; /* Include R as Classification Rule */ 
M := ML{R}; 

end 

Li+i := (A list of the whole combination of the conjunction formulae in M); 

end 

end {Induction of Classification Rules }; 



Fig. 1. An Algorithm for Classification Rules 



present history, physical examinations and laboratory examinations. In diagnos- 
tic procedures, a candidate is excluded if a symptom necessary to diagnose is 
not observed. 

This style of reasoning consists of the following two kinds of reasoning pro- 
cesses: exclusive reasoning and inclusive reasoning. First, exclusive reasoning ex- 
cludes a disease from candidates when a patient does not have a symptom which 
is necessary to diagnose that disease. Secondly, inclusive reasoning suspects a 
disease in the output of the exclusive process when a patient has symptoms spe- 
cific to a disease. These two steps are modeled as usage of two kinds of rules, 
negative rules (exclusive rules) and positive rules, the former of which corre- 
sponds to exclusive reasoning and the latter of which corresponds to inclusive 
reasoning. In the next two subsections, these two rules are represented as special 
kinds of probabilistic rules. 

4.1 Positive Rules 

A positive rule can be defined as a rule supported by only positive examples, 
which means that the classification accuracy of a rule is equal to 1.0. Thus, a 
positive rule is represented as: 

R^d s.t. R=Aj[aj=Vk]^ a r{D) = 1,0 
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In the above example, one positive rule of “m.c.h.” is: 

[nau = 0] ^ rn.c.h. o; = 3/3 = LO. 

This positive rule is often called deterministic rules. However, in this paper, 
we use a term, positive (deterministic) rules, because deterministic rules which 
are supported only by negative examples, called negative rules, is introduced as 
in the next subsection. 

4.2 Negative Rules 

Before defining a negative rule, let us first introduce an exclusive rule, the contra- 
positive of a negative rule[8]. An exclusive rule can be defined as a rule including 
all the positive examples, which means that the coverage of a rule is equal to 
1.0.^ Thus, an exclusive rule is represented as: 

d s.t. a R = Aj[aj = ^r{R) = IT. 

In the above example, exclusive rule of “rn.c.h.” is: 

[prod = 0] A [nau = 0] A [Ml = 1] ^ rn.c.h. n= 1.0, 

It is notable that exclusive rule corresponds to an upper approximation of a 
target concept. For example, the set which supports the exclusive rule above is 
an upper approximation of rn.c.h. 

/,From the viewpoint of propositional logic, an exclusive rule uniquely corre- 
sponds to 

d ^ 

because the condition of an exclusive rule correspond to the necessity condition 
of conclusion d. Thus, it is easy to see that a negative rule is defined as the 
contrapositive of an exclusive rule: 

Vj-i[aj = Vk] -id, 

which means that if a case does not satisfy any attribute value pairs in the 
condition of a negative rules, then we can exclude a decision d from candidates. 
For example, the negative rule of rn.c.h. is: 

-i[prod = 0] V ^[nau = 0] V -i[M 1 = 1]^ ^rn.c.h. 

In summary, a negative rule is defined as: 

Vj[aj = Vk] -'d s.t. V[aj = Vk] K[a,-=t,fc] (-D) = 1.0, 

where D denotes a set of samples which belongs to a class d. 

^ Exclusive rules represent the necessity condition of a decision. 




480 



S. Tsumoto 



4.3 Rule Induction Algorithm 

An algorithm for induction of positive and negative rules is derived by simple 
modification of the algorithm in Figure 1: if the thresholds of accuracy and 
coverage is set to 0.0 and 1.0, respectively, the algorithm for negative rules will 
be obtained. On the other hand, if the thresholds of accuracy and coverage are 
set to 1.0 and 0.0, respectively, the algorithm for negative rules will be obtained. 

It is notable that positive and negative rules can be extended to probabilistic 
versions, which is discussed precisely in [8]. 



5 Criteria Tables 

5.1 Representation of Rules 

Another characteristic reasoning in medicine is m— of— n concepts, or a criteria 
table, which is discussed in [2] . The criteria table for a disease d is described by 
n attributes, which are enough to make its diagnosis. If at least rn attributes are 
observed in a patient, d should be suspected. 

Langley discusses that this m— of— n description can be rewritten as a simple 
linear combination of attribute-value pairs. Thus, he implements an induction 
of this description as an induction of threshold concepts. 

However, a m— of— n rule in medicine is not equivalent to a linear combination 
rule, which is a special kind of statistical discriminant functions [3]. Rather, this 
type of rule is based on relations between sets as follows. 

1. If total n attributes are observed, a disease d is suspected with the highest 
accuracy. (The coverage is equal to 1.0). 

2. If m attributes are satisfied, a disease d should be suspected with high 
accuracy. (The coverage is equal to 1.0). 

3. If less than rn attributes are satisfied, the probability of d is low. However, 
the coverage is equal to 1.0. Thus, m— of— n concept is described as combination 
of exclusive rules (below, we call them unit rules) with the constraint that their 
accuracies are high: 

R ^ d s.t. R = tdj=i [dj = Vk] {rn <i<n) 
otn{D) > Sc 

V[«i = Vk], K[aj=Vk]{D) = 1.0, 

{^[aj=Vk]iU — ^2)7 

which also satisfies that: if R is represented as A'j^^{i < m), then aji{D) < 6^ 
holds. 

For the above example in Table 1, exclusive rule of m.c.h. is: 



[prod = 0] A [nau = 0] A [Ml = 1] ^ m.c.h. n= 1.0, o; = 1.0 
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This attains the highest accuracy. If the threshold for accuracy is set to 0.75, 
then 



[prod = 0] ^ rn.c.h. k= 1.0, q; = 0.75, 

[nau = 0] ^ rn.c.h. n = 1.0, o; = 0.75, and 
[Ml = 1] ^ rn.c.h. = 1.0, o; = 1.0. 

So, diagnostic rules for rn.c.h. can be viewed as 1— of— 3 concept. In this way, 
combination of accuracy and coverage is also important to represent m— of— n 
type rules. 



5.2 Rule Induction Algorithm 

An algorithm for induction of unit rules is derived by simple modification of the 
algorithm in Figure 1: if the thresholds of accuracy and coverage are set to 6 
and 1.0, respectively, then the algorithm for induction of each unit rule will be 
obtained. In this model, we should only add integration of unit rules after rule 
induction to obtain the total algorithm, which is not shown for the limitation of 
the space. 

6 Conclusion 

In this paper, rough set framework is introduced to model medical diagnostic 
rules. Acquired models show that the characteristics of medical reasoning reflect 
the concepts on approximation of rough sets, which explains why rough sets work 
well in medical domains. These results have not been validated by experiment 
results yet, which will be reported in the near future. 
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Abstract. In this work we check how the automatic discretization al- 
gorithms generate decision rules for the concrete medical problem - di- 
agnosing mitochondrial encephalomyopathies (MEM). 

We describe several algorithms for discretization - local and global - of 
continuous attributes obtained in the second stage of diagnosing MEM. 
All of these algorithms act together with the data analysis method based 
on the rough sets theory. 

This work compares results — quality of classification rules — which 
were obtained using different discretization methods of the continuous 
attributes. 



1 Discretization of the continuous attributes in the 
decision systems 

Data describing characteristics of real objects, which are a base for collecting 
the data in the decision systems, are usually represented by real numbers. In 
connection with it, the continuous attributes, what means, such of which values 
are the real numbers of a certain range [a, 6] , appear in such systems. The number 
of values of a continuous attribute is the higher the higher is the accuracy of the 
measurement from which those values come. The information (decision) system 
having continuous attributes is characterized by a great number of equivalence 
classes (atomic sets) - a number of the equivalence classes is frequently of order 
of a number of objects, particularly when there are many continuous attributes 
in the system. 

The precise values of the continuous attributes cause the formation of a great 
number of the equivalence classes, which results in generating a lot of decision 
rules. A great number of the decision rules gives the chance that the rules will be 
deterministic, but on the other hand too high number of decision rules hinders 
their verification by an expert. Additionally, those rules can classify poorly new 
cases, because matching the attributes values of new objects with the attribute 
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values in the rules may be hampered. For that reason it is necessary to conduct 
the discretization process of the continuous attributes. 

Discretization is based on dividing the continuous attribute range into a 
certain number of intervals and assigning determined different values to those 
intervals. The transformation of an attribute with real values into an attribute 
with discrete values takes place. Let A be a continuous attribute and the [a, b] 
interval be a range of the A attribute value. Then, the following k set of intervals 
is called the tv a partition of the [a, b] interval: 

T^A = {[ao,a.i),[ai,a2),---,[afe-i,a.fe]} 

where: uq = a, ai < for i = 1, 2, . . . , A: and ak = b. 

Thus, the discretization is a process which creates a set of the tv a intervals 
in the value range of the A the [a, b] interval attribute . 

After performing the discretization process of the continuous attribute, values 
of this attribute are transformed into the discrete values of the A^ attribute of 
which a set of values is defined in the following way: 

Vad = {1, 2, . . . , A:} 

Those values determine interval numbers of the tv a partition. 

In the discretization process of the continuous attributes, a value determined 
as level of consistency is used to evaluate consistency of the decision system 
based on a theory of rough sets: 



E lAJ^I 

xe{d}* 

W\ 



where: 

— A - a set of the system attributes 

— L - a set of the system objects 

— X - a set of objects connected with a given decision 

The simplest discretization method is based on division of the continuous 
attribute values into two intervals. Two values of a discrete attribute {tv a = 2) 
are obtained. From considerations a case when a number of intervals of the 
discretized attribute is one, is excluded because an attribute having only one 
value does not contribute any essential information into the system. 

If we consider discretization leading to the division of the continuous attribute 
value into two intervals, considering rn different values of this attribute describing 
objects of a certain decision system, there will be m — 1 ways of the division of 
those values. This number increases with geometric progression together with 
an increase in a number of intervals of the discretized attribute values. 

Since there is a great number of possibilities of finding the division points in 
the assigned set of the continuous attribute values, the process based on searching 
the best way of discretization by checking results for all possible cases of the 
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division is impossible to be carried out. For that reason different discretization 
algorithms are used, suggesting one way of division, which gives the results close 
to the optimum. 

The simplest algorithm divides a set of the continuous attribute values into 
intervals equal in width and it is called Equal Interval Width Method. This 
algorithm does not require to consider the attribute values for given objects an 
because of it, the result of its activity depends mainly on the assumed number 
of the discrete attribute value and boundaries of the continuous attribute 
value interval. 

Another algorithm, using so called the adaptive discretization [2] acts in such 
a way that first the [a, b] interval is divided into two intervals of equal width, 
and then, using discrete values, it induces decision rules. The induced rules are 
verified. If the non-deterministic rules have been induced or the consistency 
level of the system is lower than assumed, one of those two intervals is divided 
into two and the whole process of the creation of rules and their verification 
is repeated. This process is repeated until the assumed consistency level of the 
decision system is reached. 

The successive discretization algorithms use the term of entropy, calculated 
in the following way: 

k 

Ent(L/) = - '^pjiogpj 
j = l 

Where the pj values are connected with a number of the continuous attribute 
values being in respective intervals. 

Equal Frequency per Interval Method - EF is also called Maximum Entropy 
Discretization [13]. It is based on the fact that in the continuous attribute value 
set, an assigned number of intervals is created; the intervals are of that type that 
in each of them, in approximation, there is an equal number of the attribute 
values for the chosen set of objects. 

Another method, called Minimal Class Entropy Method - MCE [4], is based 
on the minimal entropy. Entropy is a criterion for searching a list of the best 
division points, which together with the boundaries of the continuous attribute 
value interval create the searched model. 

The presented discretization methods are local methods. They act each time 
on one continuous attribute. The global methods find partitions for all continuous 
attributes at the same time, using tools connected with the analysis method of 
the decision system in question. So called globalisation of local discretization 
methods can be made at that place. 

Cluster Analysis Method - CA [3] may be an example of a global discretiza- 
tion method. This method begins from a cluster analysis, and then when clus- 
tering cannot be performed any longer (because of reaching the assumed level 
of consistency), linking of adjacent value interval of particular continuous at- 
tributes is made. The entropy value is also taken as a criterion for the choice of 
the attribute and intervals for linking. 
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In this work for the discretization of the continuous attributes the following 
methods: EF, MCE and CA were used. The experiment results will presented in 
the next points. 

2 Medical problem 

Our problem was to develop and describe a knowledge base for progressive en- 
cephalopathy (PE). PE is a progressive loss of psychomotor and neuromuscular 
functions occurring in the infancy or in older children. The disease is grave and 
life threatening [1,8,12]. 

The real reasons of PE are metabolic diseases. Metabolic processes in a cell 
are catalyzed with enzymes and enzyme systems which are in nucleus and in 
various cytoplasmic organelles such as: lysosome, mitochondria and peroxysome. 

In the work we have paid attention to encephalopathy in which respiratory 
enzymes of the cell located in mitochondria’s are impaired. Mitochondrial en- 
cephalomyopathies (MEM) occur with elevated levels of lactic and pyruvic acid 
in the blood serum and the cerebrospinal fluid (CSF). MEM are the heterogenic 
group of disorders in which function disturbances may concern many organs, 
particularly the brain. 

The disease detection requires a series of tests, of which some are typically 
invasive ones and they are not indifferent to a child’s health. Invasive tests are 
divided into two groups: testing levels of pyruvic and lactic acids in the blood 
serum, and cerebrospinal fluid (CSF) in the first stage and the examinations of 
a nerve or muscle segment to determine the enzyme levels, which are the final 
tests confirming the disease. As they are most threatening to a child’s health, 
they are made as the last resort for a small group of children. 

In such a way we create a three stage classification where a set of objects 
(patients) on each classification level is smaller. The most important problem 
is to create an appropriate classification system of patients on each level. This 
consists of an appropriate choice of attributes for the classification process and 
the generation of a set of rules, a base to make decisions in new cases. 

3 Description of data and neurological tests used in the 
system 

In the first stage we made first selection. Based on clinical symptoms we created 
rules that classify the ill children into 2 groups: the suspicion of PE or other 
diseases [7]. 

The second stage consist of further elimination of children from the first 
group, who do not suffer from PE. The stage consists of taking blood samples 
and CSF and then on making biochemical examinations of pyruvate and lactate 
levels in the samples. 

Finally, on the basis on these results rules qualifying to the suitable group 
were created [9]. 
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Clinical material consists of 114 patients in the age between 3 months -15 
years (the patients were: 60 boys and 54 girls) suspected of MEM because of the 
7 following symptoms: - lactate level in blood; 

- pyruvate level in blood; 

- ratio of lactate to pyruvate level in blood; 

- lactate level in CSF; 

- pyruvate level in CSF; 

- ratio of lactate to pyruvate level in blood; 

- changes of acids level in blood and CSF. 

The values of six first attributes were real. The last one has tree value set. 
In this work we describe the discretization process of the continuous at- 
tributes obtained in the second stage of diagnosing MEM. 

This work compares results - quality of classification rules - which were ob- 
tained using different discretization methods of the continuous attributes. 

Table 1. Decision table before discretization - fragment 



1 Attribute | 


Object 


1 


2 


3 


4 


5 


6 


7 


Decision 


1150 


2.17 


1.00 


2.17 


5.37 


0.23 


23.35 


I 


3 


1160 


0.55 


0.07 


7.86 


1.24 


0.09 


13.78 


1 


2 


1170 


0.60 


0.04 


15.00 


1.20 


0.09 


13.33 


T 


2 


1190 


0.87 


0.48 


1.81 


0.91 


0.24 


3.79 


T 


1 



In the second stage, for rule generation we used machine learning program 
LERS (Learning From Examples based on Rough Sets) [5,6]. The system handles 
inconsistencies using rough set theory [10,11]. 



4 Experiment 

4.1 Discretization method 

During the second stage of diagnosis MEM we get six continuous attributes. 
We used four discretization methods of the continuous attributes 



Discretization on the basis of a control group. Making measurements of 
acid levels in the control group in blood and CSF, norms for particular attribute 
were calculated. 

To calculate the norms the T-student’s test was used. 



Table 2. Table with norm in control group method 



1 Attribute | 


1 1 1 


2 


3 


1 4 1 


5 


6 


norm 


pathology 


norm 


norm 


norm 


pathology 


norm 


norm 


1.96 


2.5 


0.43 


33.1 


2.47 


3.0 


0.3 


24.1 



Using these norms values of real attributes were digitized. 
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Equal Frequency per Interval Method. In this method we determined a 
number of intervals for 3, in order to enable a comparison with the method based 
on the control group. 

The table with the limit values for six continuous attributes is following. 



Table 3. The limit values in EF method 



Attribute 


Interval 


1 


(0, 1.07) 


[1.07, 1.9) 


[1.9, maxi] 


2 


(0, 1.2) 


[1.2, 2.8) 


[2.8, max2] 


3 


(0, 5.625) 


[5.625, 11.935) 


[11.935, maxa] 


4 


(0, 0.71) 


[0.71, 1.425) 


[1.425, max4] 


5 


(0, 0.065) 


[0.065, 0.17) 


[0.17, maxs] 


6 


(0, 3.045) 


[3.045, 12.06) 


[12.06, maxe] 


where maxi = max^ 


[ai{x) : X G E} 



Minimal Class Entropy Method. As in the previous case a number of in- 
tervals was determined for 3. 

The table with the limit values for six continuous attributes is following . 
Table 4. The limit values in MCE method 



Attribute 


Interval | 


1 


(0, 1.335) 


[1.335, 1.35) 


[1.35, maxi] 


2 


(0, 1.95) 


[1.95, 2.05) 


[2.05, max2] 


3 


(0, 8.315) 


[8.315, 8.43) 


[8.43, maxa] 


4 


(0, 1.16) 


[1.16, 1.19) 


[1.19, max4] 


5 


(0, 0.085) 


[0.085, 0.095) 


[0.095, maxs] 


6 


(0, 7.265) 


[7.265, 7.32) 


[7.32, maxe] 


where maXi = max^ 


[m(x) : X e U] 





Cluster Analysis Method The table with the limit values for six continuous 
attributes is following: 



Table 5. The limit values in CA method 



Attribute 


Begin of interval | 


1 


0.36 


2.02 


3.78 










2 


0.02 


0.15 


0.17 


0.21 


0.34 


0.43 


2.30 


3 


0.88 


12.18 


69.33 










4 


0.0 


2.54 


8.6 










5 


0.0 


0.29 


0.82 










6 


0.0 


11.43 


20.75 


45.26 
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4.2 Checking quality of rules 

Rule sets, induced from training data by a LERS system, were used to classify 
new examples. 

Training set consists 114 examples. 

Unseen set consists 50 examples - new patients. 

For these sets we are going to use naive approach to classification new exam- 
ples, where an attempt is made to classify an example using all possible rules. In 
case a bad result we are going to use classification scheme from LERS (complete 
and partial matching). 

The table compares results obtained by the methods described above. 



Table 6. Results of four discretization method 



1 Naive classification scheme | 


Discretization 


Correctly classified 


Unclassified 


Error 


method 


examples 


examples 


Rate 


Control group 


32 


14 


4/50 = 0.08 


LF 


20 


16 


14/50 = 0.28 


MCE 


13 


22 


15/50 = 0.30 


CA 


29 


14 


7/50 = 0.14 



5 Conclusion 

This work compares results — quality of classification rules — which were ob- 
tained using different discretization methods of the continuous attributes ob- 
tained in the second stage of diagnosing MEM. 

Results obtained in this work suggest explicitly that the method based on 
evaluation of norms on basis of a control group leads to the best results. Method 
based on control group has the smallest error rate. 

Error rates in the Equal Frequency per Interval Method and Minimal Class 
Entropy Method are much bigger then in control group method. 

Cluster Analysis method gives compare value of error rate and the same 
quantity of unclassified examples. 

However the division of continuous attributes into three or more intervals 
caused objection and incomprehension of physicians - experts. 

The relatively big number of unclassified new examples in each method is 
the result of incomplete data - missing attribute values. 

Those results are preliminary ones. Value of discretization methods will be 
eventually confirmed when a number of new cases will be comparable with a 
training set. 
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Abstract. This paper presents an approach to the design of approxi- 
mate time rough controllers. In this paper, the clocks used to measure 
durations required to achieve controller objectives are modeled as ap- 
proximate time windows. An approximate time window (atw) partitions 
time relative to granules (clumps of similar timing measurements) such 
as early, ontime, late. An atw determines the degree of membership of 
each observed duration in each of its temporal partitions. Based on obser- 
vations of the degree of overshoot, rise time, and settling time during the 
operation of a control system, the architecture of an approximate time 
rough control system is established. The rough controller is guided by 
rules derived from a real-time decision-making system. Rough sets the- 
ory is used to derive controller rules. f\izzy sets theory is used to model 
decision system sensors as fuzzy implications. A roughly fuzzy Petri net 
model of an approximate controller is given. The approach taken in ap- 
proximate time rough control is illustrated relative to controlling the 
pitch angle in an attitude control system for a small satellite. 



1 Introduction 

Considerable work in control technology has been reported on a rapidly growing 
range of problems being solved with a decisions systems approach [1]. This oc- 
curs especially in those niches of control engineering where the classic approaches 
have been faced with some difficulties or have appeared to perform inherently 
weakly. The objectives of this study are twofold. First, we introduce an approach 
to the design of approximate real-time controllers which utilize a combination of 
rough sets and fuzzy sets. Second, we illustrate the approximate real-time rough 
control design methodology relative to attitude control system for small satel- 
lites. Experiments with this form of rough control have been carried out in the 
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design of an attitude control system for a small, scientific satellite. This research 
has been carried out over the past two years by a research group consisting of two 
space systems engineers from a local aerospace industry as well as two professors 
and several graduate students in the faculty of engineering at the University of 
Manitoba working in cooperation with the Canadian Space Agency 




Tllll ^ ~ 

1 

Array ^ 




Fig. 1 Orbit of Satellite Fig. 2 Structure of Satellite 

For simplicity, the discussion is limited to a linearized pitch control strategy 
for nadir-pointing, momentum bias satellites (see Figs. 1 and 2) using sensors, 
momentum wheel, and torquers to achieve attitude control [2]. The focus of this 
paper is a description of how rough control rules derived from a real-time deci- 
sion system table have been used in fine-pointing for attitude control of a small 
satellite. The contribution of this paper is the application of rough sets, fuzzy 
sets and approximate time windows in the design of approximate time rough 
control systems. 



2 Knowledge Discovery Process 

The problem in this research is instrumenting sensors to obtain knowledge of 
the interdependencies of controller overshoot ov, rise time rt and settling time 
st to achieve fine-pointing in selected classes of controllers. More deeply, the 
selection of the “mechanism” which defines the operation of a sensor is a non- 
trivial task, and depends on engineering skills, intuition, and an understanding 
of the behavior of a controller. It is our knowledge of the “usual” distribution of 
controller rise time and settling time which underlies design choices concerning 



sensors. 
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2.1 Approach 

In the context of controllers, the design of an approximate-time rough control 
system is accomplished by (a) selecting a universe consisting of objects which 
are observations of overshoot as well as durations st (settling time) and Vf (rise 
time) and (b) instrumenting what is known as an approximate time window with 
attributes identified with sensors used to evaluate durations as well as sensors to 
evaluate overshoot. Overshoot (ov) is the biggest deviation of step response from 
a particular steady state after the step response has reached a tolerance band for 
the first time. Rise time rt is the time when a step response reaches 90% of its 
steady-state value for the first time, and settling time st is measured relative to 
rise time (i.e., the clock for s^ is reset at t = r^). Time itself is measured relative 
to a clock which measures durations in the context of fuzzy sets named early, 
ontime, and late. 



roughly fuzzy Petri net 




Fig. 3 Architecture of Approximate Time Rough Control System 

In other words, based on observed rise time and settling time, approximate 
time windows (designated ATWl and ATW2 in Fig. 3) are designed relative to 
fuzzy partitions of temporal intervals. Approximate time window measurements 
are defined relative to durations between firings of transitions in roughly fuzzy 
Petri nets, which were introduced in [3,6]. Approximate time windows are an ex- 
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tension of the fuzzy clock model introduced in [4]. Overshoot is conceptualized 
in terms of fuzzy sets relative to linguistic labels acceptable, big, and very big. 



2.2 Background Knowledge 

The rough sets approach to decision systems, especially in the context of real- 
time decision-making and the representation of decisions with Petri nets, has 
been investigated in [5,6]. In deriving decision system rules, the discernibility 
matrix and discernibility function are essential. Precise conditions for decision 
rules can be extracted from a discernibility matrix. The application of rough sets 
theory in control systems has been investigated by [1]. 

3 Controller Example 

As a real-time application of the approximate time rough control methodology, 
we consider the attitude control of a satellite. A high-level block diagram showing 
the principal components of the rough control system are shown in Fig. 4. 
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Fig. 4 Rough Control System 

For the sake of simplicity, we consider only one degree of freedom. The con- 
trol methodology is illustrated relative to the control of a satellite pitch angle 0 
(see Fig. 2), which is decoupled from satellite yaw and roll angle control. Using 
a PD pitch controller, the close- loop system response is given by: 

Q = + K(jS ^ ^ 1 ^ 

J T K(j[S T Kp J T T Kp 

where 0 is the pitch angular position; r, setpoint; d, disturbance; J, moment 
of inertia of the plant; and and Kp are the controller differential and propor- 
tional gain parameters. To build an information system, we define the following 
features. 

— Overshoot ov: The biggest deviation of step response from steady state 
after the step response reached the tolerance band G for the first time. 
Overshoot is divided by the height of the step demand to obtain a relative 
quantity. 
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— Rise time rt: The time when the step response reaches 90% of its steady-state 
value for the first time. 

— Settling time st: For t > st representing deviations from a steady state rela- 
tive to a tolerance band ± G. 

Here we measure the settling time relative to the rise time, i.e. the clock for st 
is reset at t = r^. Our set of objects are the step responses of the system for 
different controller gains. For each object (observed step response of the system), 
we decide on correction factors for both proportional and derivative gain, which 
change the controller parameters in order to improve the performance. It should 
be also be noted that changes in controller parameter values and Kp) in 
a decision table are inserted by an experienced control engineer using a form 
of pattern recognition. Each decision value inducing changes in and Kp is 
a judgment about controller performance from a measured step response: the 
observed response is compared to the picture of an ideal response. A decision 
table is constructed with nine condition attributes: ai, a 2 , as for granulations 
of overshoot measurement, a 4 , as, ae for rise time granulations, and ae, ay, 
ag for settling time granulations. Sample rows from a rough controller tuning 
information table are given in Table 1. 







83 


89 


84 


8s 


8fi 


8t 


8« 


8s 


d, 






{3.54.80,2.60} 


0.9 


0.1 


0.0 


0.60 


1.00 


0.93 


0.62 


0.86 


0.90 


2.00 


1.30 


Xj 


{0.50,3.104.40} 


1.0 


0.0 


0.0 


0.60 


1.00 


1.00 


0.81 


1.00 


0.90 


2.00 


1.50 




{17.50,1.50,4.80} 


0.0 


0.0 


0.8 


0.60 


0.96 


0.92 


0.60 


0.80 


0.91 


1.80 


2.00 


X4 


{0,2.10,1.10} 


1.0 


0.0 


0.0 


0.60 


1.00 


0.96 


0.91 


1.00 


0.90 


1.50 


1.20 


Xe 


{4.50,1.10,1.60} 


0.7 


0.3 


0.0 


0.72 


0.90 


0.90 


0.76 


1.00 


0.90 


0.80 


1.20 



Table 1 Decision Table (s) 

The distribution of degree of membership values in a granule associated with 
a sensor a^, 1 < i < 9, is assumed to be approximately normal in a Gaussian 
distribution with mean (modal point) m and standard deviation s (spread). Let 
g, X be the name of a granule associated with sensor a^ and measurement (e.g. 
overshoot at given instant in time), respectively. Hence, the membership function 
used in modeling sensor is given by 

g{x) = exp j 

In modeling sensors a 4 , as, ae (rise time sensors) and ae, ay, ag (settling time 
sensors), we also introduce a modulator r and strength-of-connection w. Taken 
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collectively, each trio of sensors constitutes an atw (approximate time window). 
A modulator imposes a threshold on stimuli, and a strength-of-connection raises 
or lowers the impact of an input in an atw. Then sensor in Table 1 is modeled 
as an aggregation of a fuzzy implication value and strength-of-connection w: 

ai[x) = (r ^ g[x))sw^ where(r ^ g[x) = min ^1, 

In this research, the operator s (s-norm operator) computes a probabilistic sum. 
It has been shown that the modulator and strength-of-connection parameters 
in approximate time windows can be calibrated [3]. Table 1 is processed as two 
separate decision tables in Rosetta, one to derive rules for changing proportional 
gain (di = v^^) and a second table to derive rules for changing differential gain 
(d 2 = vxd). It should also be noted that each application of a rule relative to 
an observed step response of the control system results in changes in both Kp 
and Kd. The reducts {as, as, ae, ay}, {as, ay, ag} were derived with Rosetta. A 
sampling of the controller rules is given as follows: 

Kp : [as(O.O) AND as(l.OO) AND a6(1.00) AND ay(0.81)] OR 
[as(O.O) AND as(l.OO) AND a6(0.93) AND ay(0.62)]^ di(2.00) 

Kd : [a2(0.1) AND as(l.OO) AND ay(0.81)] ^ ^ 2 ( 1 . 80) 

Such rules are derived from a real-time decision system table based on a suf- 
ficient number of prototypical experimental measurements of controller perfor- 
mance and the granulation of these measurements. 

Rule Firing Algorithm 

step 1. Let X, a^, a^, aj^, v^^, v^^. , v^^ be an experimental value observed during 
actual operation of a control system, sample decision system condition sensors 
for a sample control rule r G D(S), and sensor values from decision system table 
(U, A U {d}, V), respectively. Let s be defined as a sum s = (a^(x) - v^^J + ( 
aj(x) - Yq . ) + (aj^(x) - ) where x is an input value (e.g observed overshoot, rise 

time, or settling time) evaluated with sensor a^ in A (for example) to produce a 
particular value Yai> 

step 2. Let n, m be the number of Kp, K^^ rules, respectively. Let s^, 1< i<n, 
Sj; be sums of the form introduced in step 1 relative to n rules for Kp 

and m rules for K^^, respectively. Then let m^^, be functions defined as 

follows as follows: 

Si, Si, Sn 1 such that s[i] = min(si, Sj, s„) 
si, Sj, Sm j such that s[j] = min(si, Sj, s„) 

In other words, m^p, ^Kd 6£^ch finds the index of the smallest sum, which 
identifies the premise of a rule which is closest to the measured condition during 
the operation of a controller. 

step 3. Let Kp, K^^ be the current values of proportional and differential 
coefficients. Then compute 

Kp := Kp * d[i] and K^ := K^ * d[j] 

At this point, it should observed that variations of step 3 of the rule-firing 
algorithm are possible. First, it has been found that it is helpful to change Kp only 
if the percent of overshoot exceeds some k (e.g. k = 0.1). Second, performance 
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of the controller can be improved by forming {derived rules} U {default rules), 
constituting a population which evolves. Such refinements of this algorithm are 
outside the scope of this paper. 




Fig. 5 Closed-loop Response to Rectangular Disturbance 




Fig. 6 Closed-loop response during self-tuning 

Kp and correction rules for a PD controller have been derived using a com- 
bination of a roughly fuzzy Petri net and Rosetta. Tuning information from a 
number of controller simulations was used to build an information system and to 
generate some tuning decisions. These rules can be used to tune a satellite pitch 
controller on-line. The new information collected after each tuning were added 
to the rough control system and dynamic reducts were employed to modify de- 
cision rules periodically. A comparison of the performance of rough control with 
a PD control of the pitch angle in the presence of a rectangular (square wave 
form of) disturbance is shown in Figs. 5 and 6. Fine-pointing is achieved rapidly. 
Each step response of the rough controller is due to the firing of a pair of rules 
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(see Fig. 4) used to select appropriate changes in proportional and differential 
gains of the PD controller. This approach differs from classical PD control, since 
the gains are changing dynamically depending on the degree of disturbance. 
Other forms of disturbance have also be investigated with similar fine-pointing 
results. Further, good results have been achieved in cases where the system is 
underdamped or overdamped. 

4 Concluding Remarks 

The application of real-time decision-making in selecting gains for a satellite 
attitude control system has been presented. The approach described in this pa- 
per has been applied in a variety of control systems (hydraulic servo system, 
flood water diversion control system, temperature control system for a vertical 
mixer for preparing solid propellant for rocket engines), software quality control 
system and in control functions for interacting robots in a computer zoo. Con- 
siderable work still needs to be done in improving sensor models, approximate 
time decision-making, approximate time window models. 
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[ Abstract.] Relationships between parameters of a decision 
rule system and the minimal depth of a decision tree which 
solves the problem of the search of all realizable rules from the 
system are considered. Unimprovable upper and close to unim- 
provable lower bounds on the minimal depth of a decision tree 
are obtained. 



1 Introduction 

Decision trees and decision rule systems are widely used in different applications 
as algorithms and as a form of knowledge representation. Problems of compara- 
tive analysis of decision trees and decision rule systems are interesting both for 
theory and for practice [2, 6]. 

In this paper we consider relationships between parameters of a decision rule 
system and the minimal depth of a decision tree which solves the problem of the 
search of all realizable rules from the system. The necessity to find all realizable 
rules arises, for example, if we consider problems which can have simultaneously 
many decisions, and the number of realizable rules with the same decision char- 
acterizes the importance of this decision. The main question considered in the 
paper is to clarify can a decision tree restricts oneself to recognition values of 
only a part of attributes from a decision rule system. 

As the parameters of a decision rule system we consider the number of differ- 
ent attributes in the system, the maximal number of values of an attribute and 
the maximal length of a rule. Unimprovable upper and close to unimprovable 
lower bounds on the minimal depth of a decision tree are obtained. The main 
consequences of these results are the following: in the worst case for the solving 
of the considered problem for a decision rule system by a decision tree we must 
recognize values of all attributes from the system. However, there exist systems 
such that for the considered problem solving by a decision tree it is sufficient to 
recognize values of only a small part of attributes from the system. 

In proofs we use methods of test theory [1, 3, 4, 5] and rough set theory [7, 
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2 Main Definitions and Results 

2.1 Decision Rule Systems 

Let uo = {0, 1,2,...} and A = {a^ : i E to}. Elements of the set A will be called 
attributes. Decision rule is an expression of the kind 

where m G cj, , . . . , are pairwise different attributes from A and , . . . , 
a ^ uo. Denote this decision rule by r. The expression A . . . A = 5m 

will be called the left part of the rule r. The number rn will be called the length of 
the rule r. Denote A{r) = {a^^, . . . , and K{r) = = ^i, . . . , = 6m} 

(if m = 0 then A{r) = i^(r) = 0). 

A decision rule system S' is a finite nonempty set of decision rules. Let A(S) = 
UrG5^(^)’ n(S) = |A(S)| and d[S) is the maximal length of a decision rule 
from S. For G A(S) let Ks'(a^) = ^ G c<j, (a^ = ^) G [j^es ^(^)}* J^^note 

k[S) = max{| V 5 '(a^)| : G A(S)}. Denote D the set of all decision rule systems. 

Let S E D and A(S) = . . . , aj^} where ji < ... < j^. Denote D(S) = 

Vs{o.ji) X ... X Vs{o.jri)- For 6 = (^i, . . . , 6n) G D(S) denote AT(S, 6) = [aj^ = 
, . . . , aj^ = ^n}- We will say that a decision rule r G S is realized for the tuple 
^ifiL(r)CiL(S, 6). 

We define the problem All Realizable Rules for a system S G A as follows: for 
a given tuple 6 E E(S) it is required to find all rules from S which are realized 
for the tuple 6. Denote this problem by ARR(S). 

2.2 Decision Trees 

A finite oriented tree with the root is a finite oriented tree containing exactly one 
node with no entering edges. This singular node is called the root. The nodes 
of the tree having no issuing edges are called terminal nodes. The non-terminal 
nodes of the tree are called working nodes. A complete path of a finite oriented 
tree with the root is a sequence f di, . . . , dm^ of nodes and edges 
of the tree such that is the root, Vmpi is a terminal node and for i = 1, . . . , m 
the edge di issues from the node Vi and enters the node 

Let S' be a decision rule system. A decision tree over S is a labeled finite 
oriented tree with the root which satisfies the following conditions: 

a) every working node is labeled with an attribute from A(S); 

b) a working node which is labeled with an attribute has |L 5 (a^)| issuing 
edges, and these edges are labeled with pairwise different elements from Vs{ai); 

c) every terminal node is labeled with a subset of the set S. 

Let T be a decision tree over S. Denote by CP{r) the set of all complete 
paths of r. Let f di, . . . , Vm^ dm^ be a complete path of D. We will 

associate with f a set of attributes A(^) and a system of equations A(^). If 
m = 0 then A(^) = 0 and K[f) = 0. Let m > 0 and for j = 1, . . . , m let the 
node Vj be labeled with the attribute aq. , and the edge dj be labeled with the 
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number 6j. Then , . . . , and = 6^}. 

Denote by r(^) the set of decision rules which is the label of the node 

A systems of equations , . . . , where , . . . , G A and 

^1, • • • 7 ^ will be called inconsistent if there exist l^k G {1, . . . , m} such 

that I ^ kAl = '^k £^nd . If a system of equations is not inconsistent then 

it will be called consistent. 

Let S' be a decision rule system and T be a decision tree over S. We will say 
that r solves the problem ARR(S) if for each path ^ G CP[r) with consistent 
system of equations K{^) the following conditions hold: 

a) A(r) C A(^) for each rule r G v(^); 

b) for any rule r G S\r(^) the system of equations K (r)uA (^) is inconsistent. 
For an arbitrary complete path ^ G CP[r) denote by the number of 

working nodes in the path The value 

h{r) = max{/j(C : € CP{r)} 

will be called the depth of the decision tree P, 

Denote by h[S) the minimal depth of a decision tree over S which solves the 
problem ARR(S). 

2.3 Main Results 

One can show that {(n(S'), d(S'), fc(S')) : S G A} = {(0, 0, 0)} U {(n, d, fc) : 
n^d^k £ Lv\ {0}, d < n}. 

Let d^k E to \ {0} and d <n. Denote 

h{n^ d, k) = min{/i(S') : S' G A, n(S) = n, d[S) = d, k[S) = fc}, 

H (n, d, k) = max{/i(S) : S G A, n(S) = n, d{S) = d, k{S) = k}. 

Considered values are the unimprovable lower (the value h[n^d^k)) and the 
unimprovable upper (the value H[n^d^k)) bounds on minimal decision tree depth 
for systems S G A such that n(S) = n, d(S) = d and k[S) = k, 

[ Theorem 1.] Let n^d^k ^ uo \ {0} and d < n. Then H (n, d, k) = n. 

This theorem shows that in the worst case for a problem ARR(S) solving by 
a decision tree we must recognize values of all attributes from the set A(S). 

[ Theorem 2.] Let n^d^k G cj \ {0} and d < n. If k = 1 or d = 1 then 
/i(n, d^k) = n. If k >2 and d> 2 then 

f j n(k —1)1 , . j , . j n 

max < a, — ^ > < h(n, d^k) < a T p,d-i ' 

This theorem shows that there exist systems S such that for the problem 
ARR(S') solving by a decision tree we must recognize values of attributes from 
a small part of the set A(S') only. 
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3 Auxiliary Statements 

Let S £ U and a = = Si , = 5m} be a consistent system of equations 

such that G and G Ks'(aq ^ Ks'(ai^). Define a 

decision rule system Let r be a decision rule from S such that the system 
of equations K[r) U a is consistent. Denote by Tc^ the decision rule obtained by 
removing from the left part of r all equations contained in a. Then is the set 
of all rules such that r E S and the system K{r) U a is consistent. 

It is not difficult to prove the following two statements. 

[ Lemma 1.] Let S ^ U and a = = 5m) he a consistent 

system of equations such that ai^^ . . . ^ ai^ G (ind 6i G Vs{^ii ), • • - ,^771 e 

Vs{aiJ, Thenh{S) > 



[ Lemma 2.] Let S E U and S' he a subsystem of S. Then h[S) > h[S'). 

Now we consider some lower bounds on the value h{S), 

[ Lemma 3.] Let S he a decision rule system. Then h[S) > d[S), 

[ Proof, ]Let r be a rule from S such that the length of r is equal to d{S), Let 
T be a decision tree over S which solves the problem ARR(S') and for which 
h[r) = h[S), It is clear that there exists a complete path ^ of T such that the 
system K[r)U K[f) is consistent. Since T solves the problem ARR(S') we obtain 
that K{r) C K{f). Therefore h{f) > dfS)^ ^ d{S) and h{S) > dfS). 



[ Lemma 4.] Let S he a decision rule system such that d[S) = 1, Then h[S) > 
n{S). 

[ Pr oof, ]Let T be a decision tree over S which solves the problem ARR(S') and 
for which h{T) = h{S), Evidently there exists a complete path ^ of T such that 
the system K[f) is consistent. It is clear that for each decision rule r E S either 
K[r) C K[f) or the system K[r)uK{f) is inconsistent. Therefore A(^)flA(r) ^ 0 
for any rule r E A[r) ^ 0. Taking into account that |A(r)| < 1 for any 
rule r E S and lUss-lWl = n[S) we obtain that |A(^)| = n[S). Therefore 
> n[S)^ ^ ri{S) and h{S) > n{S). 



[ Lemma 5.] Let S he a decision rule system such that k[S) = 1, Then h[S) > 
n{S). 

[ Pr oof, ]Let T be a decision tree over S which solves the problem ARR(S') and 
for which h{T) = h[S), Let 6 = (^i, . . . , ^n(5)) ^ ^ be a complete 

path of T such that K[f) C K[S^6), Since T solves the problem ARR(S') one 
can show that the terminal node of this path is labeled with the set S and 
K{r) C K{f) for any r G S'. Therefore K{f) = K{Sj) and h{f) > n(S). 
Consequently h{T) > n(S) and h{S) > n(S). 
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4 Proofs of Theorems 

Proof of Theorem L Consider the decision rule system S = SiU S2 where = 
{ai = 0A...Aa(f = 0=>0, = 0 => = 0 =>n} and {ai = 

1 =>nTl,...,ai = k — 1 =>nTfc— 1}. If k = 1 then S2 = 0. It is clear that 
n{S) = n, d{S) = cl and k{S) = k. 

Let C be a decision tree over S which solves the problem ARR(S') and for 
which h{r) = h{S), Let 6 = (0, ... ,0) G ^(*5") and ^ be a complete path of T 
such that K{f) C K{S^6). Since T solves the problem ARR(S') one can show 
that the terminal node of this path is labeled with the set and K[r) C K[f) 
for any r E Si . Therefore K[f) = K[S^ 6) and h[f) > n. Consequently h{T) > n 
and h[S) > n. Hence H (n, d, k) > n. It is clear that H (n, d, k) < n. 



Proof of Theorem 2, Let k = 1 or d = 1. Using Lemmas 4 and 5 we obtain 
A(n, d, k) > n. It is clear that A(n, d, k) < n. Therefore A(n, d, k) = n. 

Let k>2 and d > 2. 

At first we consider the lower bound on A(n, d, k). Using Lemma 3 we obtain 
that h[n^d^k) > d. We prove by induction on d that for any n^d^k G cj \ {0} 
such that d < n the following inequality holds: 

h{n,d,k)>'2T^. (1) 

Since A(n, 1, = n we have that (1) holds if d = 1. Suppose that for some 

d > 1 the inequality h{n^t^k) > holds for any integer t, 1 < t < d. We 

prove that (1) holds for dT 1 as well. Let A be a decision rule system such that 
n[S) = n, d[S) = dpi and k[S) = k. Let S' be a subsystem of the system S 
such that A[S') = A(*S) and A[S") ^ A(*S) for any system S" C S' . It is clear 
that n[S') = n, d[S') < dT 1 and k[S') < k. Using Lemma 2 we obtain 

h{S) > h{S'), (2) 

Let S' = {ri, . . . , Tp}. It is clear that for j = 1, . . . the system of equations 
K{rj) contains some equation aq. = aj such that aq. ^ A[S') \ A(rj). Therefore 

|A'|<n. (3) 

Suppose |*S'| < WUTt Denote a = = <^ 1 , • • • , ciip = o~p}- It is clear that 

q; is a consistent system. Consider the system S'^ (see definition before Lemma 
1). Denote no = do — d[S'^) and ko — k[S'^). One can show that no > 

I < do < d and f < ko < k. Let do = 1 or /co = L Using Lemmas 4 and 5 
we obtain that h{S'^) > no > ^ > f • I^^t do > 2 and ko > 2. 

Using the inductive hypothesis we obtain that h{S'^) > f 

One can show that A Therefore h[S'^) > • Using Lemma 1 

we obtain that h[S') > h[S'^). From these relations and from (2) follows that 
h{S) > ppl. 
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Suppose now jS''! > Denote rn = n — jS''!. From ( 3 ) follows that 

m > 0. 

Let m = 0 . One can show that in this case d[S') = 1 . Using Lemma 4 we 
obtain h[S') > n[S') = |S"| > . Using ( 2 ) we obtain h[S) > 

{k — l)n 
^d+l 

Now let m > 0 . Denote B = A(N) \ {a^^, . . . , It is clear that \B\ = rn. 
Let B = {a/^ , • • • , 04^ } and l\ < . . . < 1 ^. Let j E m}. Define a set Vj . If 

I Us/ (a/^. ) \ = k then Vj = Vs' (a/^. ) . If | Us' < k then Vj is a subset of the set lu 
possessing the following properties: \ Vj\ = k and Us'/(a/^. ) C Vj. Denote U = Ui x 
. . , X Vm . Let g = d(S") — 1 . It is clear that q <rn and q < d— 1 . One can show that 
for any decision rule r E S' there exist at least tuples 6 = ( 4 i, . . . , 6m) E V 

such that the system of equations K{r)U {ai^ = 4 i, . . . , = 5 m} is consistent. 

For each 6 = ( 4 i, . . . , 6m) € U let A^(^) be the number of decision rules r E S' 
such that the system K{r) U {a/^ = = 6m} is consistent. Denote 

N = clear that N > |N'| • ^ It is clear also 

that there exists a tuple 6' E V such that N {6') > ^ = -^ > ^ ^ 

Let 6' = ( 4 ^, . . . , 4 ^). Define a tuple 6 = ( 4 i, . . . , 4 ^). Let j E m}. If 

6'- E Vs'{ni.) then 6j = 6'-. If 6'- ^ Us'(ttq) then 6j is the minimal number from 

the set Vs'{ai.). One can show that N{6) > N{6') > Denote o; = {04^ = 

4 i, . . . , = 6m} and consider the system S'^. One can show that k[S'^) = 1 , 

d[S'c^) = 1 and n[S'^) > . Using Lemma 4 we obtain that h{S'^) > • 

Using Lemma 1 we obtain that h(S') > . From this inequality and from 

( 2 ) follows that h{S) > . Thus ( 1 ) holds for d + 1 and ( 1 ) is proved. 

Now we consider the upper bound on d(n, d, k). Define a decision rule system 

5 such that A(N) = {ai,...,a^}. Denote Ej^ = {0^1^ ... ^k — 1}. Consider a 

partition {ad^...^an} = B{6) such that |i^(d)| < for any 

6 E E^~^ (we suppose that B{6i) n B{62) = 0 for any 61^62 E 61 ^ ^2). 

Some sets in this partition are possibly empty but at least one of these sets is 
nonempty. Let 6 = (di, . . . ^64-1) E Define a decision rule system S(6). 

If B(6) = 0 then S(6) = {ai = di A . . . A a^-i = 64-1 => 0 }. If B(6) ^ 0 
then S{6) = {ai = 61 A ... A 0.4-1 = 64-1 A = 0 => 0 : G B{6)}. Denote 

S = S{6). It is clear that n{S) = n, k{S) = k and d{S) = d. 

Describe a decision tree E which solves the problem ARR(N). At first we 
compute values of the attributes ai, . . .^04-1. Let oi = di, . . . ^04-1 = 64-1. 
Denote d = (di, . . . , 64-1). If B{6) = 0 then S{6) is the solution of the problem 
ARR(N). Let B{6) U 0 ^^cid B{6) = {a^^, . . . , a^^}. Now we compute values 
of attributes aq,...,a^^. Let aq = = Om- Then the set {oi = 

di A . . . A 04_i = 64-1 A Oi- = 0 => 0 : d G m}, = 0 } is the solution 

of the problem ARR(N). It is not difficult to show that the decision tree E 
solves the problem ARR(N) and h[E) < d — 1 E \ < d + Therefore 
h{n,d,k) < d + 
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5 Conclusion 

In the paper we obtained unimprovable upper and close to unimprovable lower 
bounds on the minimal depth of a decision tree which solves the problem of the 
search of all realizable rules from a decision rule system. These bounds allow 
to compare the efficiency of decision trees and decision rule systems. The lower 
bound is nontrivial and has some independent theoretical interest. 
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[ Abstract.] Decision trees are studied in rough set theory [6], [7] 
and test theory [1], [2], [3] and are used in different areas of ap- 
plications. The complexity of optimal decision tree (a decision 
tree with minimal average depth) construction is very high. In 
the paper some conditions reducing the search are formulated. 
If these conditions are satisfied, an optimal decision tree for the 
problem is a result of simple transformation of optimal deci- 
sion trees for some problems, obtained by decomposition of the 
initial problem. The decompostion properties are used to show 
that bounds given in [4] are unimprovable bounds on minimal 
average depth of decision tree. 



1 Basic Notions 

Let A be a nonempty set, F be some set of functions from A to {0, 1} and for 
an arbitrary function f E F let the relation / ^ const hold. Functions from F 
will be called attributes and the pair U = (A, F) will be called an information 
system. 

Problem, over the information system U = (A, F) is any (n + l)-tuple z = 
{vJi,-- ■, fn) where /i, . . . , /n € F,v : {0, 1}” ^ and = {0, 1, . . The 
problem z may be interpreted as a problem of searching for the value z(a) = 
^{fi{ci )^ . . . , fn{o)) for arbitrary element a G A. Different problems of pattern 
recognition [3], fault diagnosis [1] and discrete optimization [3] can be represented 
in such form. 

Two elements a and b from A will be called equivalent for the problem z if 
fi{^) = fi{^) for i = 1, . . . ,n. This equivalence relation divides the set A onto 
nonempty equivalence classes Ai, . . . , A^. Denote by Az the set {di, . . . , dg} C 
{0, where di = (/i (a^), . . . , fn{cii)) and a^ G A^, i = 1, . . . , s. A problem z 
will be called diagnostic if for any di^dj G i ^ the relation v{di) ^ ^{dj) 
holds. 

Probability distribution for the problem z is a mapping P : Az (0? 1] such 
that XIjga ^ ^ ^ interpret the number P{d) as the 

probability of the event (/i(a), . . . , /n(a)) = d. for an element a from A. 

A decision tree for the problem z is a finite oriented tree with the root satis- 
fying the following conditions: 
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a) each nonterminal vertex has assigned an attribute from {/i, . . . , /^} (i.e. 
only those attributes are used in the decision tree which are listed in the problem 
z description); 

b) from each nonterminal vertex exactly two edges leave which have assigned 
numbers 0 and 1, respectively; 

c) each terminal vertex has assigned a number from uo, 

A path from the root of the tree to a terminal vertex will be called complete. 

Let T be a decision tree for z and ^ be a complete path in F . Assume ^ 
contains t > 1 nonterminal vertices . . . , for j = 1, . . . , t the vertex Vj has 
assigned the attribute fi. and the edge, which leaves the vertex vj and enters to 
the vertex has assigned the number 6j E {0, 1}. Then we will say that the 

system of equations {fi^ [x) = . . . , [x) = 6t} corresponds to the path The 

set of solutions in A of the system of equations, which corresponds to the path 
will be denoted by A(^). We assume = A for a path f which consists of 
a terminal vertex only. The set of complete paths in F will be denoted by Ei[F). 
It can be shown that |J = A^ and for any two different complete paths 

^es{r) 

^ 1 , ^2 the relation A(^i) fl A(^ 2 ) = 0 holds. Moreover, for an arbitrary complete 
path f In F either Ai C A(^) or Ai fl = 0 for i = 1, . . . , s. 

We will say that the decision tree F solves the problem z if for i = 1, . . . , s to 
the terminal vertex of the path f such that Ai C A(^) is assigned the number 
z(a^) where a^ is an element from A{. 

For i = 1 , . . . , s we denote by hi the length of the path fi such that Ai C 
A(^^). The value Yli=i ^hl be called the average depth of decision tree 

F relatively to the probability distribution F (or, in short, F -average depth of 
F). A decision tree F for the problem z solving z and having minimal P-average 
depth, will be called optimal for z and F. 



2 On Decomposition of Problem 

2.1 Auxiliary Notions 

Let U be an information system, z = (i/, /i, . . . , /^) be a problem over U and 
P be a probability distribution for z. For arbitrary attribute fi E {/i, . . .,/n} 
and arbitrary number 6 E {0, 1} we put Az{fi^S) = {d E d = (di, . . .d^), 
di = d}. 

For each subset F C Az we denote 

N{rp) = j2m- 

6eT 

2.2 Proper Decomposition of the Problem 

Let U = (A, F) be an information system, zq = (i/Q, /i , • • • , /no) ^ diagnostic 
problem over U with rn equivalence classes Ai, . . . , Am and Az^ = (dj, . . . , d^) 
where df* = (/i . . . , (^i)) £^nd G A^, i = 1, . . . , m. For i = 1, . . . , m let 
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/L • • • 7 /n ) a problem over the information system F) with Si 



equivalence classes and Az- 




. . , Let Fi be a probability distribution 


for the problem z^, i = 0 




Let t = C 


{0, l}’^o+---+nm wPere = 


(«. 


. -«^), e {0, = 0, . . . ,m, and 


r 




if = 0, 


A = 




if = < 


1 


(0, ...,0),iffcG to} \{i}, 




Define a 


function n : {0, l|’^o+---+nm ^ g^g follows : 


v{5) = 


II 


) , if ^ G P and S = 
iiS^ P. 


Consider the problem z = 




••VnoVt'" Vni •••<0 over U 



where 




/j(a), if a G Ai, 
0, if a ^ 



for j = 1, . . . , i = 1, . . . , m and a e A. One can show that Az = F, Define a 
probability distribution F for the problem z as follows: F[a'j) = Fo{6^)Fi{6j) for 
j = 1,.. .,Si ^iidi = 1, . . . ,m. The (m+l)-tuple {{zq.Fo), {zi, Fi), . . . , {zm.Fm)) 
will be called a proper decomposition of the pair (z, F) if the following conditions 
hold: 

1) Po{5°)N{AzXfj, 1),C) < \ _min Po{5) for j = 1, . . . , n^, i = 1, . . . , m; 

2) for any z, j G {1, . . . , m}, i and c G cj such that ^i{^) 

6eAz- ,Pi{6)=c 

> 0 and qj = ^j{^) F 0 the following inequalities min(g^,gj) <1/2 

,Pj(S)=c 

and max(g^,gj) < 1 hold. 

Let ((zo, Po), ( 2 ^ 1 , • • • , Pm)) be a proper decomposition of the pair 

(z, F) and for i = 0, . . . , m let be a decision tree for the problem z^, which 

solves z^ . For i = 1 , . . . , m assign each nonterminal vertex of the decision tree Fi 
instead of the attribute /j the attribute /j. Denote by Fi the obtained decision 
tree. For i = 1, . . . ,m find a complete path in Pq such that Ai C and 

change the terminal vertex of the path to the root of the decision tree Fi. 
Denote the obtained tree by i7(Po, Pi, ... , Pm)- 



2.3 Main Result 

[ Theorem l.]Let z he a problem over an information system U ^ F he a proha- 
hility distribution for z and ((zq, Po), (zi, Pi), . . . , (zm, Pm)) a proper decom- 
position of the pair (z, P). Let Fi he a decision tree for Zi solving the problem 
Zi and optimal for Zi and Fi^ i = 0, . . . , m. Then the tree i7(Po, Pi, . . . , Pm) i^ 0 . 
tree for the problem z^ which solves z and which is optimal for z and F , 

We omit the proof of Theorem 1 because it is too long. Further we will 
consider some application of this result. 
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3 Some Application of Decomposition 

Theorem 1 allows us to construct optimal decision trees for some problems over 
information systems. In this section we show that the announced in [5] result 
about closeness to unimprovable upper bound on minimal average depth of de- 
cision trees from [4] can be obtained as a consequence of Theorem 1. 

3.1 Psirameters of Problems and Probability Distributions 

At first we define the parameter M{z) for a problem z = (i/, /i, . . . , over an 
information system U , If z[x) = const on A then M(z) = 0. Let z[x) ^ const on 
A. For an arbitrary n-tuple 4 = (4i, . . . , 6n) G {0, we denote by M{z^ 4) the 
minimal natural number rn such that there exist numbers D, . . . , ^ {L • • • ^ 

which satisfy the following condition: either the set of solutions on A of the 
system of equations {fi^{x) = 4^^, . . . ^ fi^{x) = Si^} is empty or z[x) = const 
on this set. Then 

M(z) = _ max M(z,4). 

As a parameter of a probability distribution P for the problem z we will 
consider the value 

which is called the entropy of the prohahility distribution P. 

For a problem z and a probability distribution P for z we denote by /i(z, P) 
the minimal P-average depth of a decision tree for z solving the problem z. 



3.2 Close to Unimprovable Upper Bound on Minimal Average 
Depth 

Following statement gives us the upper bound on the minimal average depth of 
a decision tree for an arbitrary problem over an information system. 

[ Theorem 2.] [4] Let z he a problem over an information system U and P be a 
prohahility distribution for z. Then 

f M{z), if M{z)< I, 

h(2,P)<\ M{z) + 2H{F), it2<A«(i)<3, 

Following statement characterizes the quality of upper bound from Theorem 

2 . 

[ Theorem 3.] [5] For arbitrary natural numbers rn>2^n there exist an infor- 
mation system a problem z^ over IFIf with rn^ equivalence classes and the 
prohahility distribution Pff = ^ such that H{P) = nlog 2 ra^ 



M(z) 



m — 1, if n = 1, 
m, if n = 2, and h[z^P) 
m T 1, if n > 3, 



(m + 2)(m- 1) 

^ -n. 

2m 
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3.3 Proof of Theorem 3 

Let m > 2, n be arbitrary natural numbers. At first we describe the information 
system Define the system of circles on the plane. By definition, 
consists of rn circles on the plane pairwise disjoint. Let the system has been 
already defined. Then the system B^^ consists of rn circles pairwise disjoint, and 
each of them contains the system of circles B\^^ . We will say, that a circle from 
B^ is a circle of zero kincf if it does not contain circles from Bf^. Let the circles 
of kinds from 0 to i — 1 has been defined, where i < n. We will say that a circle 
from Bf^ is a circle of Ath kind if the kind of each circle, which embedded in it 
is not greater then i — 1, and at least one circle, which embedded in it, is a circle 
of [i — l)-th kind. One can show that Bf^ contains s = rn^ circles of zero kind. 
Denote them Ci, . . . , C^. Denote ai the set of points on a plane, which is situated 
inside the circle Ci and denote A — {ai, . . . , a^}. Set into correspondence to the 
each circle C G Bf^ the function / : A ^ {0, 1}. The function / takes the value 
1 on the element if the set of points is situated inside the circle C and 
otherwise it takes the value 0. Denote F = {/i,. . .,/t} the set of functions, 
which corresponds to all circles from Bf^. Then = (A, F). 

Let = (i/, /i, . . . , /t) be a diagnostic problem over t/^. The following 
statement gives us the value of parameter M(z^) for the problem 

[ Lemma l.]Le^ m > 2,n be arbitrary natural numbers. Then 

{ m — 1, if n = 1, 

TO, if n = 2, 
m T 1, if n > 3. 

[ Proo/.] Consider the case n > 3. Let us estimate the parameter M[zf^^S) for 
an arbitrary 6 G {0, 1}^. Let 6 = (0, ... ,0). The system of equations {/q(^) = 
= 0} does not have solutions on the set A, if , . . . , fi^ are pair- 
wise different attributes from F which correspond to circles of the (n — l)-th 
kind. Therefore M 6) < rn. Let 6 ^ (0, . . . , 0). Denote by Co the circle of the 
least kind such that in 6 the value of the attribute corresponding to Co is equal 
to 1. Let Co be a circle of no-th kind and fi^ be an attribute corresponding to 
Co. If no = 0 then the system of equations {/io(^) = 1} has only solution in the 
set A and M(z^, ^) = 1. Let no > L Denote by , . . . , fi^ the attributes corre- 
sponding to the circles of (no — l)-th kind, which embedded in Co. By the choice 
of the circle Co, in ^ values of attributes , . . . , fi^ are equal to 0. The system 
of equations {/io(^) = 1, /q(^) = 0, . . . , fi^{x) = 0} does not have solutions in 
the set A and M[zf^^6) < m-\- 1. Therefore 

M{zl)<m+l. (1) 

We will show that the value m T 1 is obtained on the t-tuple 6 = (^i, . . . , 6t)y 
such that no = 2, the values of the attributes corresponding to circles, which 
includes the circle Co, is equal to 1, and the values of other attributes is equal to 
0. Let S = {fji{x) = , . . . , fjf,{x) = 6jf^} be an arbitrary system of equations 
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such that the set of solutions of this system in A is empty or z'^{x) = const on 
this set. Let I E { 1 , . . Change the equation fji{x) = to the equation 
fi^{x) = 0 if the circle corresponding to the attribute fj^ is embedded into 
the circle corresponding to the attribute fi^ or equals to this circle for some 
r G {l,...,m}. Otherwise, change the equation fji{x) = 6j^ to the equation 
fio{x) = 1. Make this changes for I = 1, . . . , and denote the obtained system 
by Ai. It is clear that has at most k equations, the set of solutions of in 
A is a subset of the set of solutions of S in A, and is a subsystem of the 
system S 2 = {fio{x) = l,/q(x) = 0, = 0}. Suppose that S\ ^ A 2 . 

One can show that in this case z^[x) ^ const on the set of solutions of in 
A. But it is impossible. Therefore > m T 1 and M (z^, ^) > m T 1. From this 
inequality and from (1) follows that M (z^) = mT 1. The cases n = 1 and n = 2 
are considered similarly. □ 

Proof of Theorem 3, We will proceed by induction on n. Let n = 1. Define 
the decision tree T for the problem z^. The decision tree T contains m — 1 
nonterminal vertices t;i , . . . , Vm- 1 , which assigned pairwise different attributes 
/17 • • • 7 fm-i respectively and rn terminal vertices Vm^ ^ 1 , • • • , For i — 

1 , . . . ,m — 1 the vertex Vi leaves two edges, which assigned the numbers 0 and 
1 respectively. The edge, which is assigned the number 0, enters to the vertex 
and the edge, which is assigned the number 1 , enters to the vertex wi . 
For i = 1, . . . , m — 1 the vertex wi is assigned the number z^(a^) where ai is 
the element from the set A such that fi{ai) = 1. The vertex Vm is assigned the 
number z^(ao) where ao is the element from the set A such that fi{ao) = 0 
for i = l,...,m — 1. The decision tree T does not contains other vertices and 
edges. One can show that the decision tree T solves the problem z^ and T 
is optimal for z^ and Let us evaluate average depth of P: h{P^P^) = 
EE‘ < i + (m - 1) A = Then ft(4, P^) = IhitMpll, 

Consider now a natural number n > 2 such that the considered state- 
ment is true for any natural number less then n. Consider a decomposition 
{{zo,Po),{zi,Pi),...,{zm,Pm)) of the pair {zf^.Pf^). The diagnostic problem 
zo contains attributes corresponding to all circles of (m — l)-th kind from the 
system Each of the diagnostic problems zi, . . . , z^ contains attributes cor- 
responding to all circles from one of the system which is contained into 

and these systems are pairwise different. The problem z^ is defined such 
that Zi[x) is a restriction of the mapping z : A ^ cj on the set A^, i = 1, . . . , m. 
For i = 0, . . . , m let Pi be an uniform probability distribution for the problem 
Zi. One can show that ((zq, Fq)? • • • ? ^m)) is proper decomposition 

for the pair (z^, F^). One can show that /i(zq, Pq) = h{z^^ P^) and h[zi^ Pi) = 
PffP^) for i = 1, . . . ,m. Using the inductive hypothesis we obtain that 
h{zo, Po) = and h{zi, Pi) = (^ _ 1 ) for i = 1 , . . .,TO. Let 

Pi be a decision tree for the problem z^, which solves z^ and which is optimal 
for z^ and i = 0, . . . , m. Let i? = i7(Fo, A, • • • , An)- From the definition of 
the tree J? it follows that /j(J7, P”) = h{Po,Po) + YCiLi 
Using Theorem 1 we obtain h)ZmXm) = 




512 



L Chikalov 



References 

1. Chegis, L, Yablonskii, S.: Logical methods for electric circuit control. Trudy MIAN 
SSSR 51 (1958) 270-360 (in Russian). 

2. Moshkov, M.: Conditional tests. Problemy Cybernetici 40 (1983) 131-170 (in Rus- 
sian). 

3. Moshkov, M.: Decision Trees. Theory and Applications. Nizhni Novgorod University 
Publishers, Nizhni Novgorod (1994) (in Russian). 

4. Moshkov, M., Chikalov, L: Bounds on average weighted depth of decision trees. 
Fundamenta Informaticae (1997). 31 145-157 

5. Moshkov, M., Chikalov, L: Bounds on average depth of decision trees. Proceedings of 
the Fifth European Congress on Intelligent Techniques and Soft Computing, Aachen 
(1997) 226-230 

6. Pawlak, Z.: Rough Sets - Theoretical Aspects of Reasoning about Data. Kluwer 
Academic Publishers, Dordrecht (1991) 

7. Skowron, A., Rauszer, C.: The discernibility matrices and functions in information 
systems. Intelligent Decision Support. Handbook of Applications and Advances of 
the Rough Set Theory. Kluwer Academic Publishers, Dordrecht (1992) 331-362 




On Diagnosis of Retaining Faults in Circuits 



Albina Moshkova'*' 

Faculty of Calculating Mathematics and 
Cybernetics of Nizhni Novgorod State University 
23, Gagarina Av., Nizhni Novgorod, 603600, Russia 



Abstract. Diagnosis of faults in circuits is important field of applica- 
tions of rough set theory and test theory. The problem of search an 
optimal circuit basis for an arbitrary closed class of Boolean functions is 
considered. The basis should be optimal in the sense of simplicity of diag- 
nosis of so-called retaining faults in iteration-free circuits over the basis. 
This problem is solved for all closed classes. In the paper the complex- 
ity of diagnosis of retaining faults in iteration-free circuits over optimal 
bases is studied. 



1 Introduction 

The structure of the system of all classes of Boolean functions closed over substi- 
tution operation has been described by Post in [7] and [8]. Yablonskii, Gavrilov 
and Kudriavtzev in [10] studied the structure (slightly different from that of 
Post) of all classes of Boolean functions closed over substitution operation with 
the assumption that if a Boolean function is given then all functions which differ 
from this function by unessential variables are given. The latter structure will 
be used in the paper. 

We will consider the realization of functions from closed classes by combi- 
natorial circuits and we will study the depth of decision trees for diagnosis of 
so-called retaining faults in these circuits. The faults under consideration con- 
sist of the change of the function realized by a gate such that there exist two 
tuples on which values of the function are invariable and are equal to 0 and 1 
respectively (if a gate realizes a constant then this gate has no faults). 

The problem of diagnosis of arbitrary circuits is a complicated problem. 
Therefore a nonstandard approach to design and diagnosis of circuits is con- 
sidered [4, 5]. Only formula-like circuits over special chosen bases are used. In 
addition to the usual work mode of the circuit there exists the diagnostic mode 
in which the circuit transforms to so-called iteration-free circuit for which the 
diagnosis problem solves efficiently. 

The main problem considered in the paper is to find for an arbitrary closed 
class of Boolean functions a circuit basis for this class (an optimal basis) for which 
the minimal depth of decision trees for diagnosis of iteration- free circuits over this 
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basis grows most slowly with growth of the number of gates in circuits. Optimal 
bases are found for all closed classes of Boolean functions. In the paper the 
complexity of diagnosis of retaining faults in iteration-free circuits over optimal 
bases is studied. Note that analogous problem for constant faults on gate inputs 
was solved in [5] for 40 closed classes. 

In proofs we use methods and results of test theory [1, 2, 3] and rough set 
theory [6, 9]. 

2 On Diagnosis of Iteration- Free Circuits 

Let /(xi, . . . ^Xn) be a Boolean function. The variable Xi of the function / will 
be called essential^ if there exist two n-tuples 6 and a from {0, which differ 
only in the Nth digit and such that f{6) ^ /(d"). The variables of function / 
which are not essential will be called unessential 

A eircuit basis is a finite nonempty set B of Boolean functions for each of 
which all variables are essential. Divide the basis B into two parts: B = BqABf 
where Be = B D {0, 1} and Bp = B \ {0, 1}. Let A be a combinatorial circuit 
over the basis B. We will assume that circuit gates can have the following faults 
which will be called retaining faults. Let a gate realizes a constant from Bc^ 
Then this gate has no faults. Let a gate realizes a function /(xi, . . . G Bp. 
Then there exist two n-tuples q;j and aj- from {0, such that /(o;j) = 0, 
/(oij) = 1 and the gate with a fault can realize an arbitrary Boolean function 
/'(xi, . . . ^Xn) such that f'{oij) = 0 and f'{oAj^) = L 

For example, let a gate realizes the function f{x^y) = x V y, aj = (0,0) 
and aj. = (1,1). Then the considered gate with a fault can realize an arbitrary 
Boolean function from the set {x V y , x • y, x, y}. 

The diagnosis problem for a circuit consists in the recognition of the function 
realized by the circuit which, possibly, has retaining faults of gates. For this 
problem solving we use decision trees. Each check of a decision tree consists 
in observing the output of the circuit at the inputs of which a binary tuple is 
supplied. As a complexity measure we consider the depth of a decision tree (the 
maximal number of checks which this tree realizes). 

The number of gates in the circuit S will be denoted by L{S) and the minimal 
depth of a decision tree which solves the diagnosis problem for the circuit S will 
be denoted by h[S). 

A combinatorial circuit will be called iteration-free if each node (input or 
gate) of it has at most one issuing edge. It is not difficult to prove the following 
statement. 

Proposition 1. Let B be a circuit basis for which Bp ^ Then there exists a 
constant c such that h{S) < c- L{S) for any iteration-free circuit S over Bp. 

We denote by C{B) the minimal natural number c for which the inequality 
h{S) < c - L[S) holds for any iteration- free circuit S over Bp. If Bp = 0 then let 
C{B) = 0. For a Boolean function / we denote by p[f) the number of essential 
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variables of the function /. For a nonempty finite set of Boolean functions D let 
p{D) = max{p(/) : / G D}. The following statement describes the behavior of 
the parameter C{B). 



Theorem 2 . Let B he a circuit basis. Then 




if p{B) = 0 
2 if p{B)>Q. 



3 On Optimal Bases 

We will say that two Boolean functions are equal if they differ on unessential 
variables. As in [10] we will assume that if a Boolean function / is given, then 
all functions which are equal to / are given. Let D be a nonempty set of Boolean 
functions. The closure of the set U over operation of substitution will be denoted 
by [U]. The set U will be called a closed class if U = [U]. 

Let D be a closed class of Boolean functions and 5 be a circuit basis. We 
will say that B correctly generates the class U if U = [Bp] [J Bc^ 

Let B correctly generates the class U . This basis may be used for constructing 
circuits which realize functions from U and have effective algorithms for diagnosis 
of retaining faults of gates. Let / G D. 

Assume that / G [Bp]. Let (/? be a formula over Bp realizing the function 
/. Then the circuit S over Bp is constructed according to the formula p and 
satisfying the following conditions: 

a) the circuit S realizes the function /; 

b) L{S) = L{p) where L{p) is the number of functional symbols in the 
formula p; 

c) from any gate of the circuit S issues at most one edge. 

In addition to the usual work mode of the circuit S there exists the diagnostic 
mode in which the inputs of the circuit S are "split” so that it becomes the 
iteration-free circuit S. The relations h[S) < C{B) • L[S) = C{B) • L[p) hold for 
the circuit S. 

Assume now that / is a constant from Bc^ Then we can realize the function 
/ by a circuit S which has one gate and for which h{S) = 0. 

One can prove the following statement. 

Proposition 3. For any closed class U of Boolean functions there exists a cir- 
cuit basis B which correctly generates the class U . 

Denote C{U) = minC(5) where 5 is a circuit basis which correctly generates 
the class U . A circuit basis B will be called an optimal circuit basis for the class 
U if B correctly generates the class U and C{B) = C{U). 

A nonempty finite set of Boolean functions D will be called a basis of the 
class U if [D] = U and for any set D' C D the relation [D'] U holds. Denote 
p[U) = min p{D) where D is a basis of the class U . For each closed class U the 
value p{U) can be found in [10]. 

Optimal circuit bases were constructed for each closed class U . The following 
theorem describes the behavior of the parameter C{U). 




516 



A. Moshkova 



Theorem 4. Let U he a closed class of Boolean functions. Then 

r 0 i/ p{U) = 0 
-2 if p{U) > 0. 

Further we will consider examples of some closed classes of Boolean functions. 

1. The class Ci comprises all Boolean functions. For this class C{Ci) = 2 and 
{x V y} is an optimal circuit basis. 

2. The class L\ comprises all linear functions. For this class C{L\) = 2 and 
{x T y, X T y T 1} is an optimal circuit basis. 

3. The class comprises all self-dual functions. For this class C{D^) = 6 and 
{x • y \/ X • z\/ y • z} is an optimal circuit basis. 

4. The class A\ comprises all monotone functions. For this class C{A\) = 2 and 
{x • y, X V y, 0, 1} is an optimal circuit basis. 
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Introduction 

Diagnosing of faults in circuits is a significant area of applications for test theory 
[1 , 3, 4, 5, 6] and rough set theory [2] . The circuits considered in the paper realize 
Boolean functions and use gates each of which realizes a function from some finite 
set of Boolean functions (basis for circuits) . We will consider the class of so-called 
nonelementary faults of these circuits. This class contains different combinations 
of A- and V- types of faults, constant faults and "negation”- type of faults. We 
will consider the problem of construction of decision trees with minimal depth 
which recognize the function realizable by a given circuit probably with faults. 
We will consider not the set of all possible faults of a circuit, but some subsets 
containing at most k > 2 functionally distinguishable faults. In the circuits faults 
are represented by gates which realize functions from some finite set of Boolean 
functions (basis of faults). We will consider nonelementary bases of faults. A 
nonelementary basis of faults contains function which is neither conjunction nor 
constant, function which is neither disjunction nor constant, function which is 
not linear function. 

It is shown that for an arbitrary nonelementary basis of faults for any circuit 
in an arbitrary finite basis with at least 2 gates and n > 4 inputs, and for any 
fc, 2 < ^ ^ ^ functionally distinguishable faults of 

circuit for diagnosing of which it is necessary to use decision trees which depth 
is at least — 1 . 



1 Definitions and Notation 

A circuit will be called a finite oriented contour-free graph where 

1) each vertex without incoming arcs is assigned the constant 0 or 1, or 

a variable from the alphabet X — • • •}; here the variables assigned to 

different vertices are different; 

2) to each vertex, which has r > 1 incoming arcs, there is assigned some 
Boolean function being essentially dependent on r variables; 

3) all the arcs are assigned numbers such that if some vertex has r incoming 
arcs then these arcs are assigned numbers from 1 through r; 
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4) exactly one vertex is labeled by *. 

The vertices, which are labeled by variables, are called circuit inputs; all 
other vertices are called gates; and the vertex labeled by * is called an output 
of the circuit. 

We assume that a) the circuit contains at least one input and one gate; and 
h) a gate is taken as the circuit output. 

We will speak that in the circuit a vertex v\ precedes a vertex V 2 if there 
exists an oriented path from the vertex v\ to the vertex V 2 - Sometimes we will 
speak that the vertex v\ immediately precedes the vertex V 2 if v\ and V 2 are 
connected by the arc {vi/V 2 )- 

To determine a function realizable by the circuit, each vertex of the circuit 
can be associated with a function in the following way: 

1) the vertex which is labeled by a variable (constant) will be associated with 
the function being equal to this variable (constant); 

2) let vq be a vertex which is labeled by a Boolean function . . . , ?/r), 
and vi^ . . . /v-r be the vertices immediately preceding vq. Let the arc {vi/vo) be 
labeled by the number i^i = 1, . . . , r, and let the vertices t;i , . . . , be already 
associated with the functions gi^ . . . respectively. Then, let the vertex vq be 
associated with the function 0(^i, . . . , ^t-). 

The circuit will realize the function associated with the vertex labeled by *. 

An arbitrary finite nonempty set of Boolean functions P containing function 
which is neither conjunction nor constant, function which is neither disjunction 
nor constant, function which is not linear function, will be called a nonelementary 
basis of faults or simply basis of faults. 

Let ^(ft) be some Boolean function from P^ xt = (xi, . . . ,x^). Now let us 
determine an operation of introducing a ^-fault into the circuit S. 

1) Let t > 0. Add to S the vertex e and assign to it the function and then 

draw the arcs (t;i , e), . . . , (t;^, e) where t;i, . . . , = t; is an arbitrary sequence of 

vertices in S. Upon it, let us assign the number i^i = l,...,t, to the arc (t^i, e). 
The obtained circuit will be denoted by U'] and here, if v contains the output of 
N, then the vertex e will be labeled as the output of U' . Now, in U' let us take 
the arc such that Vi is a vertex from t), and the vertex lu differs from e, 

w is not contained in t;, and w does not precede any vertex from t;. In t/', let 
us draw the arc (e,cc;), eliminate the arc (t;^,c<;), and assign its number to the 
arc (e,c<;). The obtained circuit will be denoted by U". Concerning the circuits 
U' and U" we will speak that they have been obtained from the circuit S by 
introducing the ^-fault. 

2) Let t = 0. In this case, ^ is the constant 0 or 1. Now, let us add the vertex 
e to the circuit N, assign to it the function and mark it as the circuit output. 
The obtained circuit will be denoted by P' . Further, let us arbitrarily choose an 
arc (t^i, V 2 ) in A, add the vertex e to N, assign to it the function draw the arc 
(e, 172 ), assign to it the number of the arc (t^i, 1 ^ 2 ), and eliminate the arc (t^i, 1 ^ 2 )- 
The obtained circuit will be denoted by P" . Concerning the circuits P' and P" 
we will speak that they have been obtained from the circuit S by introducing 
the ^-fault. 
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Now determine a set of circuits Hp{S) in the following way: 

1) SeHp{S); 

2) Let U G Hp[S) and ^ Then all the circuits, possibly obtained 

from S by introducing the ^-fault, belong to Hp[S) as well. 

3) The set Hp[S) contains no other circuits. 

Let F be some set of Boolean functions. Then by F we will denote the set 
of all superpositions of functions from F . Note that the set P is some closed 
class of Boolean functions, containing function which is neither conjunction nor 
constant, function which is neither disjunction nor constant, function which is 
not linear function. 



1.1 Problem of Diagnosing 

Divide the set of circuits Hp{S) into subsets Hp{S)^Hp{S)^,,,^H'p{S) such 
that all circuits from the same subset realize the same Boolean function and cir- 
cuits from different subsets realize different functions. Let ... Pk ^ {1^ • • • ^ 

Now we define the problem of diagnosing S with respect to subsets iLp(N), 

. . . , Hp (S). For an arbitrary circuit U G Uj=i {^) required to find the 
subset which contains the circuit U , 

For solving this problem we will use decision trees. 

The set of all Boolean functions realizable by circuits from Uj=i Pp {^) 

be denoted by Fp'“''^^{S) [Fp{S) = Fp‘“'^{S) is the set of Boolean functions 
realizable by circuits from Hp{S)). A circuit S with n > 1 inputs will be denoted 
by Sn if convenient. 

A decision tree Y for the circuit Sn with respect to subsets Hp{Sn)j . . 
Hp{Sn) is a finite oriented rooted tree where each nonterminal vertex is labeled 
by a tuple from { 0 , l}’^, and each terminal vertex is labeled by some number 
from the set {A, • • • , A}* /,From each nonterminal vertex there run out exactly 
two arcs which are labeled by numbers 0 and 1 respectively. For any function 
fij G Fp (An), which is realized by a circuit from Hp{Sn)^ there exists some 
complete path 7 = . . . ^Vr^Ur^Vr^i (from the root to a terminal vertex) 

such that the vertex Vr^i is labeled by the number ij] and if for q= 1 , . . . , r the 
vertex Vq is labeled by the tuple aq G {0, l}’^, and the arc Uq is labeled by the 
number 5q G {0, 1}, then the function /q. is the single function from Fp''"'^^ [Sn] 
which on the tuples o;i, . . . , o;^ has the values • • • 7 respectively. 

The maximal length of a complete path is called the depth of a decision tree 
Y and is denoted by h{Y). 

Let = minF(F), where minimum is found among all decision 

trees for Sn with respect to subsets F/p (N^), . . . , Hp[Sn). 

In this paper we study the value hp{Sn) = maxFp’’"’^^ {Sn) where maximum 
is found among all tuples from the set {(A, • • • Aifc) • 1 ^ N < • • • < 
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2 Results 

Let 

a’l = min{A„,fc- 1}, 

= min{2”, fc — 1}, 

where n > 1, fc > 2, and 

^^”)=(Ln/2j)+(Ln/2j + l)- 

Denote by G[S) the set of functions assigned to vertices of the circuit S, 

[ Theorem 1.] 

Let P he an arbitrary noneleraentary basis of faults and be some circuit 
in an arbitrary basis which has n inputs and at least two gates. 

a) If the set PuG[Sn) contains only monotone Boolean functions and n > 4^ 
then for any fc > 2 

< h^Sn) < 

b) If the basis P contains only monotone self-dual Boolean functions^ the set 
G[Sn) contains at least one nonmonotone function^ and n > 4^ then for any 
fc > 2, 

< hps^) < Pi 

c) If the set P U G[Sn) contains a nonmonotone Boolean function and P 
contains a function which is not monotone self- dual function then for n > 1 and 
fc > 2, 

PLi < hpSn) < Pi 
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Abstract. In this paper we present an exemplary algorithm classifying 
new objects by matching them directly against data table to generate 
relevant decision instead of matching it against all rules generated from 
data table (see [1]). We report results of experiments on three medical 
data sets, concerning lymphography, breast cancer and primary tumor 
(see [8]). We compare standard methods for extracting laws from decision 
tables (see e.g. [17], [1]), based on rough set (see [13]) and boolean reason- 
ing (see [2]), with the method based on algorithms calculating relevant 
decision rules for new objects. We also compare the results of computer 
experiments on those data sets obtained by applying our system based 
on rough set methods with the results on the same data sets obtained 
with help of several data analysis systems known from literature. 



1 Introduction 

A classification algorithm is an algorithm which permits us to repeatedly make 
a forecast on the basis of accumulated knowledge in new situations (see e.g. [9]). 
We consider here a classification related to construction of a classifying algorithm 
which on the basis of current knowledge will be applied to a number of cases to 
classify objects previously unseen; each new object will be assigned to a class 
belonging to a predefined set of classes on the basis of observed values of suitably 
chosen attributes (features). 

Many approaches have been proposed for constructing classification algo- 
rithms, among them we would like to point out classical and modern statistical 
techniques (see e.g. [9]), neural networks (see e.g. [9]), decision trees (see e.g. [3], 
[15], [9]), decision rules (see e.g. [4], [8], [6] [18], [12]) inductive logic programming 
(see e.g. [5]). 

The most popular method for classification algorithms construction is based 
on learning rules from examples. One can use rough set methods to discover 
rules from data sets (see e.g. [17], [1]). The methods based on calculation of all 
reducts allow to compute, for given data, the descriptions of concepts by means 
of decision rules (see [13]). Unfortunately, the searching problem for reduct of 
minimal length (minimal number of attributes) is AU-hard (see [16]). Therefore 
we often apply approximation algorithms to obtain some knowledge about the 
reduct set (see [11], [19]). Another approach can be based on construction of 
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algorithms not requiring calculation of the decision rule set before new objects 
are classified. These algorithms classify any new object by generation of relevant 
rules for this object only. The aim of the paper is to present an example of 
such algorithm. We compare the results of classification algorithm based on 
standard rough set methods with methods based on presented algorithm and 
we also compare rough set methods with classification algorithms known from 
literature. 

From experiments it follows, that algorithms generating only decision rules 
relevant to new cases are faster than algorithm based on decision rule generation 
especially when they are applied to large decision tables. The results suggest 
that methods using rough set techniques can be treated as a promising tool 
for extracting laws from experimental data sets and their performance is fully 
comparable with the performance of other classification systems (see also [1] for 
more complete discussion). 

2 Classification Algorithms based on Decision Rules 

We assume that the reader is familiar with basic notions of the rough set theory 
(see [13]) and methods of decision rules generation based on boolean reasoning 
(see [2]). In particular by A = ([/, A U {d}) we denote the decision table and by 
RU L[A) we denote the set of all optimal basic decision rules of A (i.e. decision 
rule with minimal number of descriptors in predecessor and only one decision 
descriptor in successor - see [13], [1]). The cardinality of the image d[U) = {k : 
d[s) = k for some s E U} is called the rank of d and is denoted by r(d). We 
assume that the set of values of the decision d is equal to The 

decision d determines partition CLASSA{d) = of the universe 

[/, where = {x E U : d{x) = v^} for 1 < k < r(d). The set is called the 
i-th decision class of A. By Xa{u) we denote the decision class {x E U : d[x) = 
d(ti)}, for any u E U Af r E RUL{A)^ then by Pred(r), Succor) we denote the 
predecessor of r and the successor of r , respectively, and by d[r) we denote the 
decision value specified by Succ[r). An object u E U is matched by a decision 
rule r G RUL[A)^ iff u belongs to the set describing the meaning (see e.g. [13]) 
of Pred[r). Then we will say that the rule is classifying u to the decision class 
d(r). The set of all objects matched by decision rule r is denoted by Match a{t). 
An object u E U supports a decision rule r G RUL[A) iff u belongs to the 
meaning of Pred[r) and u belongs to the meaning of Succ[r). The set of all 
objects supporting a decision rule r is denoted by SuppA{^)^ If r is a decision 
rule in A, then the number //a(^) = card{tdat^ ) called the coefficient of 
consistency of the rule r. If /iA(^) = 1 then we will say that the decision rule r 
is consistent in A, otherwise the decision rule r is inconsistent or approximate 
in A. 

Let W = (IF, AU {d}) be a hypothetical universal decision table (including 
known and unknown objects describing an actual considered aspect of reality - 
see [1]) and by A = (F, A U {d}) we denote a given subtable of the universal 
decision table. Let u E IF be a so called tested object Our task consists in 
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assigning the value d[u) of the decision d to the tested objects u, basing only on 
values of condition attributes of and relying on a given decision table A. A 
solution of this problem is a classification algorithm sufficiently approximating 
the decision function d. There are many methods for decision rules generation 
from data (see e.g. [6], [7], [8], [17], [1]). 

When a set of decision rules has been computed then it is necessary to use 
some methods to resolve conflicts between rule sets classifying tested objects 
to different decision classes. Therefore we use some measures of the strength of 
rule set matched by a given tested object and classifying this object to decision 
classes pointed out by the rules from this set. In this paper we consider the global 
strength of rule set presented in [1] and defined by 



Global Strength[Xi^ ut) 



card IJ Match a{ 

\reMRul{Xi,ut) 

card[Xi) 






where Uf E W is Si tested object, MRul[Xi^ut) C RU L[A) is a set of all calcu- 
lated basic decision rules for A, classifying objects to the decision class Xi and 
matching tested objects ut. The global strength of decision rule set is similar to 
strength of rule presented in [8] and [6] . 

Sometimes the decision rules generated by applying rough set methods can- 
not be accepted as laws valid for data encoded in a given decision table. This 
happens, e.g. when the number of examples supporting the decision rule is rela- 
tively small. Therefore one can use approximate decision rules instead of consis- 
tent decision rules to construct the classification algorithm for a decision table 
A. Different methods (see [1], [10], [14], [20]) are now widely used to generate ap- 
proximate decision rules. In our experiments (see Section 4) we used the method 
of approximate rules generation described in [1]. 



3 Decision rule synthesis by matching new objects 
against data tables 

The method of classification algorithms construction mentioned in the previous 
section is based on methods for decision rules generation from decision tables. 
We have applied (see e.g. [1]) an algorithm for reduct set computation to obtain 
decision rule set. However, the time cost of the reduct set computation can be 
too high when the decision table has too many attributes or/and different values 
of attributes or objects. Therefore we often apply some approximate algorithms 
to extract some knowledge about the reduct set (see [11], [19]). Another ap- 
proach can be based on construction of algorithms not requiring calculation of 
the decision rule set before new objects are classified. These algorithms classify 
any new object by generation of relevant rules for this object only. In this paper 
we present an example of such algorithm. 

Let W = (W, AU {d}) be a universal decision table and let A = ([/, AU {d}) 
be a given decision table [U C W). By EQLa{ui:U2) we denote the set {a G 
A : a[ui) = a[u2)} for any ui^U2 G W. 
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An object Ur E U classifies a tested object Uf E W to the decision class 
X^i^r) iff card[EQLA{ur:Ut)) > 0. 

An object Ur E U exactly classifies a tested object Uf E W to the decision 
class XA{ur) iff Ur classifies ut to the decision class XA{ur) and for any u E U: 

EQLA{Ur^Uf) c EQLA{u^Ur) => d[u) = d[Ur)^ 

An object Ur E U does not classify Si tested object ut E W exactly to the decision 
class XA{ur) iff Ur classifies ut to the decision class XA{ur) and two conditions 
are satisfied: EQLA{ur^ut) C EQLA{u'^^Ur) and d{u'fi) 7 ^ d{ur) for some object 
vf. 

Let Ur E U and ut G W. By aA{ur^ ufi we denote the number 

card[RECA{ur:Ut) fl AA(^r)) 
card[RECA{ur: ufi) 

called the coefficient of classification of ut by object Urj where RECA{ur^ut) is 
the set 

{u £ U : card{EQLA{u^ut)) > 0 A EQLA{ur^ut) C EQLA{u^Ur)}. 

It is easy to see that the object Ur exactly classifies an object ut iff aA{ur:Uf) = 1. 

If G W, G and a G [0,1] then by RECf^{a^ut) we denote the set 
{u E U : d[u) = V A card[EQLA{u^ ufi) > 0 A q;a(^, Uf) > a}. 

Now we present an example of classification algorithm. It is based on the 
value RECj^[a^ut) for any tested objects ut^ the fixed threshold a and decision 
values V G {vf 

Algorithm A2. Classification by generation of relevant decision rules 

Input: 

1. A = ([/, AU {d}) is a given decision tahle^ where U = {ui, and A = 

{ai, am}t A is a suhtahle of a universal decision table W = (W, AU {d}) 
fee. U CW), 

2. a tested object ut E W , 

3. a classification threshold oq oq = 0.9). 

Output: decision value for object ut. 

Data structure: 

OA : array L,,n of boolean values (0 or 1)^ 

OL : integer list of object numbers (maximum size: n)^ 

AL : integer list of attribute numbers (maximum size: rn)^ 

DVQ : integer army L.,card(V^) of decision value weights. 
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Method: 

For i = 1 to n do OA[i] = 0, 

For i = I to n do 
IfOA[i] = 0 then 
begin 

AL — EC^L 

If card[AL) > 0 than 

begin 

OL = 0 and count = 0 
For j = 1 to n do 
begin 

If AL C EQE^iut^Uj) than 
begin 

count = count T 1 

If d[ui) = d[uj) than OL = OL U Uj 
end 
end 

If > a than 

for any ui G OL C {ui^ — U do OA[/] = 1 

end 

end 

For i= I to cardfVl) calculate DVQ[t] = 

Extract MDVQ = e : DVQ[i] = 

Randomly choose decision value Vg from MDVQ. 

Return Vg □ 

It is easy to see that the time and space complexity of Algorithm A2 are of 
order 0(n^ • m) and 0(n T m T card(V^)), respectively. 

One can treat our method as a method of generation of those decision rules 
only (with coefficient of classification not lower than the fixed threshold) which 
can be involved in the classification of a given tested object Uf. It is not necessary 
to compute all decision rules and then to match of Uf against all of them. 

One can show that results of the above algorithm are equivalent, in a sense, 
to the results of the algorithm based on calculating all decision rules with con- 
sistency coefficient not lower than the fixed threshold and the global strength of 
rule set (see Section 2) as a strategy for conflict resolving. 

Let us observe that the presented method generates only rules relevant to 
tested objects. This can save time of computation because it is not necessary to 
match a new object against all decision rules generated for decision table. 

4 Experiments with Data 

We present the results of experiments performed on the following three medical 
data sets: lymphography, breast cancer and primary tumor (see [8]). The medical 
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Data 


Algorithms 


Coefficients 
/To or oo 


Number 
of rules 


44me 


Error rate 


drain 


dest 


drain 


dest 


Lymphography 


A1 


0.9 


175 


57.32 


0.05 


0.085 


0.193 




A2 


0.9 


— 


54.18 


0.38 


0.095 


0.200 


Breast cancer 


A1 


0.75 


647 


50.33 


0.58 


0.261 


0.271 




A2 


0.75 


— 


10.80 


1.16 


0.261 


0.271 


Primary tumor 


A1 


0.75 


6599 


625.78 


8.42 


0.366 


0.679 




A2 


0.75 


— 


35.62 


3.77 


0.392 


0.697 



Table 1. Comparison of selected methods with rules and without rules 



data used in our experiments were obtained from the University Medical Centre, 
Institute of Oncology, Ljubljana, Yugoslavia. Thanks go to M. Zwitter and M. 
Soklic for providing the data (see [8]). 

The lymphography data consists of 18 condition attributes (nominal at- 
tributes with small numbers of values), one decision attribute (4 decision classes) 
and 148 objects. 

The breast cancer data consists of 9 condition attributes (nominal at- 
tributes with small numbers of values), one decision attribute (2 decision classes) 
and 286 objects. 

The primsiry tumor data consists of 17 condition attributes (nominal 
attributes with small numbers of values), one decision attribute (22 decision 
classes) and 339 objects. 

We have applied the cross-validation method of estimating error rate (see 
e.g. [9]). Ten-fold cross validation was performed by using tested classification 
algorithms. 

The algorithm presented in this paper is implemented in object-oriented pro- 
gramming library: ”RSES-lib”, creating the computational kernel of the system 
"ROSETTA” (see [12] for more details). The computers used for experiments 
were HP Workstations (series 9000, model 712/60MHz, 12.9 MElops, 79 MIPS). 

We report results of experiments performed with help of classification algo- 
rithm allowing to predict decision for new objects by computing only decision 
rules relevant for new objects classification (algorithm A2 - see Section 3) We 
compare there results with those obtained by algorithm A1 based on calculating 
all decision rules with consistency coefficient for rules not lower than the fixed 
threshold //q and the global strength strategy of rule set (see Section 2). Table 1 
shows the results of the considered classification algorithms for the medical data 
sets. In case of the cross-validation method we present the average (from all 
folds) values of error rate, time of computation and number of rules. 

Table 2 shows error rates for other classification systems obtained using med- 
ical data sets. The tested in this paper learning systems fall into two categories: 

1. methods based on Decision Trees: Assistant Professional (see [3]), C4.5 (see 

[15]). 

2. methods based on Decision Rules: AQ15 (see [8]), CN2 (see [4]), Naive LERS 

and New LERS (see [6]). 
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System 


Error rate | 


Lymphography 


Breast cancer 


Primary tumor 


Assistant Professional 


0.240 


0.220 


0.560 


C4.5 


0.230 


— 


0.600 


AQ15 


0.180-0.200 


0.320-0.340 


0.590-0.710 


CN2 


0.180 


0.320-0.340 


0.550 


Naive LERS 


0.380 


0.490 


0.790 


New LERS 


0.190 


0.300 


0.670 


RSES-lib 


0.193 


0.271 


0.679 



Table 2. A comparison results for medical data 



It is easy to see that the our results are fully comparable with results obtained 
when using other systems. 

5 Summary 

We have presented an example of classification algorithm for computing decision 
rules relevant to particular new objects. The experiments show that the results 
of these algorithm are similar to results of the algorithm using the whole set 
of decision rules generated on training cases. From experiments it follows, that 
algorithms generating only decision rules relevant to new cases are faster than 
algorithm based on decision rule generation especially when they are applied 
to large decision tables. Performed experiments have also proved that results of 
classification algorithms based on rough set theory are fully comparable with 
results of known from literature alternative classification systems. 
Acknowledgment: This work was supported by the Polish State Committee 
for Scientific Research grant No. 8T11C01011 and the Research Program of the 
European Union: ESPRIT-CRIT2 No. 20288. 
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[ Abstract.] In this paper Rough Set Theory (RS) is employed 
to discuss “rule+exception” modeling, which will have fewer 
rules compared with rule-based modeling and fewer exceptions 
compared with example-based modeling. An attribute reduc- 
tion strategy based on discernibility matrix is described. We 
attempt to consider what kind of data sets are suitable for the 
model, and how to distinguish exceptions within the data sets. 
To illustrate the principle the psychological model of Nosof- 
sky’s category learning is simulated, and three more complex 
examples are provided. 



1 Introduction 

In Shepard, Hovland, and Jenkins’ psychological study [4], six types of classifi- 
cation problems on category learning were presented (see sec. 5 in this paper). 
Nosofsky, Gluck, and Glauthier [1] conducted more complicated replications and 
extensions of this classic study. Later Nosofsky, Palmeri, and McKinley [2] ex- 
plained the set of learning data by the RULEX model based on the “Rules plus 
Exceptions” model. They claimed that for problems of type 1 and 2, concise rules 
are available; for type 3 , 4 and 5, concise rules and a few additional exceptions 
are needed; for type 6 however, the only way is to store all of the examples in 
the memory. This seems to imply three different types of data sets which cor- 
respond to three different models under the consistent classification constraints: 
(1) can be modeled as concise rules; (2) can be modeled as concise rules plus 
some exceptions; (3) can not be modeled as concise rules and need to store exam- 
ples. We believe it is important that different data sets require different models. 
Especially, if some data sets adopt a “rule+exception” model, they will need 
fewer rules compared with the rule-based model, and will have fewer exceptions 
compared with the example-based model. 

In order to discuss the above three models, the following two questions have 
to be addressed first: (1) Eor a given data set, which model is suitable, rule- 
based, example-based, or “Rule+ Exception” -based? (2) Given a data set which 
is suitable for “Rule+Exception,” how to distinguish exceptions within data 
sets? 
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2 An Example 

Table 1 is the “Car Classification” data set taken from [7], where mileage is the 
dependent variable. 



Table 1. The car example 



No. Size Cyl Turbo Fuelsys Displace Comp Power Trans Weight Mileage 



1 


comp 


6 


y 


EFI 


med 


high 


high 


auto 


med 


med 


2 


comp 


6 


n 


EFI 


med 


med 


high 


manual 


med 


med 


3 


comp 


6 


n 


EFI 


med 


high 


high 


manual 


med 


med 


4 


comp 


4 


y 


EFI 


med 


high 


high 


manual 


light 


high 


5 


comp 


6 


n 


EFI 


med 


med 


med 


manual 


med 


med 


6 


comp 


6 


n 


2-BBL 


med 


med 


med 


auto 


heavy 


low 


7 


comp 


6 


n 


EFI 


med 


med 


high 


manual 


heavy 


low 


8 


sub comp 


4 


n 


2-BBL 


small 


high 


low 


manual 


light 


high 


9 


comp 


4 


n 


2-BBL 


small 


high 


low 


manual 


med 


med 


10 


comp 


4 


n 


2-BBL 


small 


high 


med 


auto 


med 


med 


11 


subcomp 


4 


n 


EFI 


small 


high 


low 


manual 


light 


high 


12 


sub comp 


4 


n 


EFI 


med 


med 


med 


manual 


med 


high 


13 


comp 


4 


n 


2-BBL 


med 


med 


med 


manual 


med 


med 


14 


subcomp 


4 


y 


EFI 


small 


high 


high 


manual 


med 


high 


15 


subcomp 


4 


n 


2-BBL 


small 


med 


low 


manual 


med 


high 


16 


comp 


4 


y 


EFI 


med 


med 


high 


manual 


med 


med 


17 


comp 


6 


n 


EFI 


med 


med 


high 


auto 


med 


med 


18 


comp 


4 


n 


EFI 


med 


med 


high 


auto 


med 


med 


19 


sub comp 


4 


n 


EFI 


small 


high 


med 


manual 


med 


high 


20 


comp 


4 


n 


EFI 


small 


high 


med 


manual 


med 


high 


21 


comp 


4 


n 


2-BBL 


small 


high 


med 


manual 


med 


med 



RS reduction [3] is employed to get the rule set, which is shown in the left side 
of table 2. If the 20^^ example in table 1 is considered as an exception, another 
set of rules, which has fewer and more concise rules, is found by RS reduction, 
as shown in the right side of table 2 (In table 2 an asterisk means that feature 
can be ignored in the corresponding rule.) “Rule+ Exception” model needs two 
attributes and four rules compared to four attributes and six rules needed by 
the rule only model, with one exception (the 20^^ example conflicts with rule 1 
in the right side of table 2). For roughness, rule model is 1, while rule+exception 
model is about 0.95. 



3 Principle 

The key idea of the “rule+exception” model is how to select the set of exceptions 
from a given data set. First we describe our algorithm, we will justify it later in 
this section. 
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Table 2. The complete rule set and the reduced rule set. 



Rule 1 


Rule - 


- Exception 


size 


fuelsys 


displace 


weight 


mileage 


size 


weight 


mileage 


comp 




med 


med 


med 


comp 


med 


med 


comp 


2-BBL 


small 




med 








light 


high 




light 


high 


subcomp 








high 


subcomp 1 




high 




EEl 


small 




high 


>1= 




>1= 


heavy 


low 


>1= 


heavy 


low 



(1) Reduce a given database IR, obtaining a set of rules R. Create a frequency 
histogram of the rules. 

(2) Select exceptions E according to the criterion described below. 

(3) Delete E from W and form a new data set W' = W — E. 

(4) Repeat step (1) for database W' and find a new rule set R\ 

(5) Use R' to test the given database W. Examples conflicting with R' con- 
stitute an exception set E' . Generally, E' C E, 

R' T E' is the “rule+exception” model of the given data set. If E is empty 
after step (2), it means the given data set is not suitable for a “rule+ except ion” 
model. 

In step (1), the set of rules is a reduction based on RS for the given data 
set. In view of the importance of attribute reduction for obtaining more concise 
rules and a smaller rule set, we present an attribute reduction strategy with 
polynomial time complexity O(n^) based on a discernibility matrix as follows. 

Let M be the Discernibility Matrix of a decision table D. An element of M , 
Tij^ is called a terra. It is the set of all attributes which discern the and 
examples in D. A= {ai, U 2 , is the set of all attributes in D. Rj C A [5]. 

Let p{ak) be the number of all terms in M containing attribute aj^, called the 
attribute frequency function of aj^. Then p{ak) is the significance of the attribute 
aj^. By this definition, the following reduction strategy can be designed: 

Let Co be the set of cores on M , and let R = Cq. 

(1) Let Q = {Tij : Rj fl R 0, i R iA = 1,2, . . .,n}, M = M — and 
B = A-R; 

(2) For all ak G R, compute p(aj^) in M, and let p{aq) = maxj^(p(aj^)); 

(3) Let R= RU {a^}; 

(4) Repeat the above steps till M = 0; 

The set R is an attribute reduction of decision table D. 

Although we have proved in theory that this strategy is incomplete for min- 
imal attribute reduction [6], it is very efficient in our experiment. 



4 The suitability of the ^^Rule+Exception” model 

Given a consistent data set lU, a set of rules R can be obtained by reduction 
under the consistent decision constraint. If this data set can be modeled as 
“Rule+Exception,” then by applying the above principle, this data set can be 
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divided into a set of exceptions E and the remaining data set W'. Reducing W' 
again, another set of rules R' can be acquired. Generally, R' can not ensure the 
correct classification of the exceptions E. Therefore to make the above procedure 
nontrivial, R' should be more concise than R and the set E should be small. 
This is the key idea that decides whether or not a data set can be modeled as 
“RuleT Exception.” 

Let A and A' be the condition attribute sets of the rules R and R' produced 
by reducing W and IT', respectively. E is a set of exceptions. Let CARD{.) be 
the function that computes the number of the elements of a given set. Function 
/(.) is a function for computing the frequency of a rule. For a given data set to 
be modeled as “Rule+Exception,” the following conditions should be satisfied: 



max{/(R^) : Rr e R} ^ min{f{Rr) : Rr E R} > 0 (1) 

CARD{A) - CARD{A') > a (2) 

CARD{R) - CARD{R') > 7 (3) 

CARD{E) < CARD{R) - CARD{R') + /? (4) 



In the above conditions, (1) is a necessary condition for a given data set to be 
modeled as “Rule+Fxception,” where 0 is an integer larger than 1 (generally 0 
is a very large positive integer). It means that, if data set IT can be divided into 
rules and exceptions, then there must exist at least one rule whose frequency is 
fewer than other rules. This condition can also be written as: 

E = {Rr : f{Rr) < 0,Rr e R} (5) 

That is, if the frequency of a rule is below 6^, then the examples corresponding 
to it can be regarded as exceptions. In this paper we use 

E = {Rr : f{Rr) = l,Rre R} (6) 

as the exception criterion. 

If IT satisfies conditions (2) and (3), then R' is relatively concise . If condition 
(4) is satisfied, then there will be few exceptions. Here o;, 7 , and (3 are constants 
related with the given database. 

If all the above conditions are satisfied by IT, we say IT is suitable to be mod- 
eled as “Rule +Exception.” That is, the rule+ except ion model is more effective 
for IT than the rule-based or the example-based models. 

It is interesting that the above principle also implies the conditions for a data 
set to be modeled by rule-based or example-based model. If (1) is not satisfied, 
and 

max{/(R^) \ Rr ^ R} = mm{f{Rr) : Rr ^ R} (7) 

and (2) and (3) are satisfied, then a rule-based model is recommended for this 
data set. If none of conditions (1), (2) or (3) are satisfied, and CARD[A') = 
CARD{A) or CARD{R') = CARD{R)^ an example-based model is recom- 
mended for the data set. 
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Table 3. Shepard’s six problems 





Pi 


P2 


P3 


P4 


P5 


a 




abed 


abed 


abed 


abed 


abed 


abed 


1 


0 0 0 A 


0 0 0 A 


0 0 0 A 


0 0 0 A 


0 0 0 A 


0 0 0 A 


2 


0 1 0 A 


0 0 1 A 


0 0 1 A 


0 0 1 A 


0 0 1 A 


1 0 1 A 


3 


0 1 1 A 


1 1 1 A 


1 0 1 A 


0 1 0 A 


0 1 0 A 


0 1 1 A 


4 


0 0 1 A 


1 1 0 A 


0 1 0 A 


1 0 0 A 


1 1 1 A 


1 1 0 A 


5 


1 0 0 B 


1 0 0 B 


1 0 0 B 


1 1 0 B 


1 0 0 B 


0 0 1 B 


6 


1 1 0 B 


1 0 1 B 


1 1 0 B 


1 1 1 B 


1 1 0 B 


1 0 0 B 


7 


1 0 1 B 


0 1 0 B 


1 1 1 B 


1 0 1 B 


1 0 1 B 


0 1 0 B 


8 


1 1 1 B 


0 1 1 B 


0 1 1 B 


0 1 1 B 


0 1 1 B 


1 1 1 B 



5 Shepard’s six problems 

In this section, the above principle is used to repeat Nosofsky, P aimer i, and 
McKinley’s study [2] on Shepard’s six problems, shown in table 3. 

(1) Reduce the above data sets Pi and obtain the rule sets Ri{i = 1,2,.. ., 6 ), 
as shown in the left side of table 4. 

(2) Create the frequency table of Ri{i = 1,2, .. ., 6 ), as shown in the right 
side of table 4. 



Table 4. The rule sets of Shepard’s six problems and its frequency table 



The rule set | 


T 


le frequeney table 




Ri 


R2 


R3 


R4 


R5 


Rq 




Ri 


R2 


R3 


R4 


R5 


Re 




a d 


a b d 


abed 


abed 


abed 


abed 


1 


0 A 


0 0 A 


0 0 * A 


0 0 * A 


0 0 * A 


0 0 0 A 


f(Ty 


4 


2 


2 


2 


1 


1 


2 












1 0 1 A 


f(2j 












1 


3 




1 1 A 


* 0 1 A 


0 * 0 A 


0 * 0 A 


0 1 1 A 


f(3y 




2 


1 


1 


1 


1 


4 






0 * 0 A 


* 0 0 A 


1 1 1 A 


1 1 0 A 


W 






1 


1 


1 


1 


5 


1 B 


1 0 B 


1 * 0 B 


1 1 * B 


1 0 * B 


0 0 1 B 


W 






1 


2 


1 


1 


6 






1 1 * B 




1 * 0 B 


1 0 0 B 


W 






~T 




~T 


~T 


7 




0 1 B 




1 1 B 




0 1 0 B 


T(7y 




~2~ 




~T 




~1~ 


S' 






* 1 1 B 


* 1 1 B 


0 1 1 B 


1 1 1 B 


f(8) 






~T 


1 


~T 


1 



(3) According to conditions (2), (3) and ( 6 ), exceptions Ei are selected, and 

P' = Pi-Ei. 

Problems 1 and 2 do not satisfy condition ( 6 ), therefore they don’t have 
any exceptions; for problem 3, Es = {3, 4, 5, 8 }, = {1,2, 6 , 7}; for problem 

4, E^ = (3,4,7, 8 }, = (1,2, 5, 6 }; for problem 5, £^5 = (3,4, 5,8}, P 5 = 

( 1 , 2 , 6 , 7}. Because problem 6 does not satisfy conditions (1), (2) and (3), it has 
no exceptions (or it has no rules). 

(4) Reduce the databases P'- again and obtain rule sets Rj{j = 3,4,5), see 
the left of table 5. 
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(5) Compute the frequency table of the reduced rule set ^'j{j = 3, 4,5), see 
the right of table 5. 



Table 5. The reduced rule set and its frequency table. 



4Te reduced rule set | 


|4Te frequency of each rules 




Rif 


Rb^ 1 




Rr 




Rule 


a d 


Rule 


a d 


Rule 


a d 


1 


0 A 


1 


0 A 


1 


0 A 


f(TJ 


3 


f(TJ 


3 


f(TJ 


3 


6 


1 B 


5 


1 B 


6 


1 B 




3 


f(5j 


3 




3 



(6) Produce the final exceptions E'^ which are examples conflicting with 
Rj. For problem 3, E'^ = {3,8}; for problem 4, E'^ = {3,8}; for problem 5, 
^5 = { 4 ? 3 }. 

Since Problems 1 and 2 have m8ix{ f[Rr) : Rr € R} = mm{f[Rr) : Rr € R}y 
they do not satisfy condition (1), therefore these problems can not be modeled 
as “Rule+Exception.” However, they satisfy conditions (2) and (3), so a rule- 
based model is suggested for them. Problems 3, 4 and 5 satisfy all the above 
conditions, so they can use the “Rule+ Exception” model. Problem 6 does not 
satisfy conditions (1), (2) and (3), so an example-based model is suggested. 

6 Three more complex databases 

In this section, three databases from the UCI repository are selected to illustrate 
the principle described in this paper. The problems are Voting, which is suit- 
able for a “Rule+Exceptions” model, Moral-Reasoner, suitable for a rule-based 
model, and Soybean, which is suitable for an example-based model. 

The Voting data set is the 1984 questionnaire on the 435 congressmen in the 
U. S. House of Representatives about 16 key problems. Using above method we 
can analyze this database as below: 

(1) Through reduction, 44 rules with 9 attributes are produced. The fre- 
quency histogram of the rules is shown in figure 1. 

(2) Delete all the examples (21 in all) that are used only once. Reduce the 
remaining database again, and 6 rules with 4 attributes are obtained. 

(3) Use the 6 rules to test all of the 435 examples. 13 examples are found 
conflicting with the rules. Thus six rules and thirteen exceptions are a solution to 
the Voting problem. Obviously, this database satisfies all the conditions for using 
the “Rule+Exception” model, and is a typical database for it. Its roughness is 
about 0.97. The 6 rules are shown in table 6. It can be interpreted in natural 
language as: 

The main divergences between Republicans and Democrats are: Republicans 
agree on the physician-fee-freeze (PEE) and anti-satellite-test-ban (ASTB) prob- 
lems, while Democrats disagree or keep silence. On the adoption-of-the-budget- 
resolution (ABR) and synfuels-corporation-cutback (SCO) problems. Democrats 
agree while Republicans disagree. 
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Fig. 1. The frequency histogram 
of the Voting experiment 




Fig. 2. The frequency histogram 
of the Moral-reasoner experi- 
ment 



Table 6. The rule set of the Voting data set 



Rule 


ABR 


PEE 


ASTB 


see 


PARTY 


0 


n 


y 






republican 


1 


>1= 


n 


* 


>1= 


democrat 


2 


>1= 


y 


y 


>1= 


republican 


3 


y 


* 


n 


y 


democrat 


4 




? 


* 




democrat 


5 


>1= 


y 


* 


n 


republican 



If the effect of exceptions is ignored, the above description can be regarded 
as the conclusion of the questionnaire. If the effect of the exceptions has to be 
considered, it is not difficult to analyze their difference from the final rules and 
translate them into natural language. 

If the given data set is more suited for a rule-based model, the method in 
this paper will recommend a rule-based model for the given data set rather than 
a “Rule+Exception” model. Let’s see the next example. 

The Moral- Reasoner database has 202 examples and 23 attributes. After 
reduction, only 4 rules with 1 attribute remain. Its corresponding histogram is 
shown in figure 2. 

As Shepard’s problem 2, this database does not satisfy condition (1), none 
of the rules are infrequent, so it is recommended to be modeled as rule-based. 

The Soybean database contains 307 examples and 35 condition attributes. 
Through RS reduction, 116 rules with 9 attributes are produced. Since rules of 
different classes have quite different histograms, only those rules which belong 
to the alternarialeaf-spotand and frog-eye- leaf-spot classes satisfy condition (1). 
And when the “Rule+Exception” principle is applied to the examples of the two 
classes, the numbers of attributes and rules become 8 and 111 respectively, with 
14 exceptions. As Shepard’s problem 6 it can not perfectly satisfy condition (2), 
(3) and (4), and an example-based model is recommended. 
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7 Conclusions 

The “rule+exception” modeling can be applied to machine learning to decrease 
the learning cost, but the more important purpose of considering exceptions in 
this paper is that a more concise set of rules is frequently more comprehensible. 
By removing the exceptions, humans can more readily understand what a huge 
database tells us. The “Voting” experiment is just one example for such kind of 
applications. And this is only a direct application of RS reduction, we believe 
RS theory can be applied to more broad areas. 
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Abstract. We show that finding optimal discretization of instances of 
decision tables with two attributes with real values and binary decisions 
is computationally hard. This is done by abstracting the problem in such 
a way that it regards partitioning points in the plane into regions, subject 
to certain minimality restrictions, and proving them to be NP-hard. We 
also propose a new method to find optimal discretizations. 



1 Introduction 

Discretization of attributes with real values is an important problem in knowl- 
edge discovery and data mining. It is an indispensable tool in data analysis and 
extraction of decision rules from decision tables. A lot of effort has been spent 
to find effective methods for real value attribute discretization (see [1,2,3,10]). 
We show that certain optimization problems motivated by the discretization 
problem are NP-hard. 

To facilitate extracting decision rules from a decision table with real value 
attributes, the decision tables are usually discretized. Among the well known 
discretization methods are those based on the equal width and equal frequency 
intervals, statistical tests [7], entropy [1,2,3], adaptive quantizers and dynamic 
programming. All these methods are heuristics for discretization of data. 

We restrict our attention to the discretization that selects a set of cut points of 
attributes, which determine a partition of the real value attributes into intervals. 

'f—rk 

The set of cuts determines a grid in fc-dimensional space with ni=i regions, 
where k is the number of attributes and is the number of intervals of the 
attribute. It was shown in [9,10] that the problem Optimal Discretization, to 
find a consistent discretization of a given decision table with minimal number 
of cuts, is NP-hard. In this paper we discuss the computational complexity of 
the problem of finding a consistent discretization with the minimum number of 
regions. The main result of this paper is showing that the decision problem of 
checking if there is a consistent set of cuts such that the grid defined by them 
contains at most K regions, for a given A, is NP-complete. We also improve the 

* This research was partly supported by Polish State Committee for Scientific Research 
grant No. 8T11C03614, No. 8T11C01011 and Research Program of European Union 
- ESPRIT-CRIT2 No. 20288. 
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result of [10] by showing that the problem Optimal Discretization remains NP- 
hard even if a given decision table is restricted to two attributes. These results 
justify the search for approximation algorithms yielding as good discretizations 
as possible. 



2 Basic notions 



An information system [11] is a pair A = (t/, A), where D is a non-empty 
finite set called the universe and A is a non-empty finite set of attributes^ i.e. 
a : U ^ Va for a G A, where Va is called the value set of a. Elements of 
U are called ohjeets. An information system of the form A = (?7, A U {d}) is 
called a deeision tahle^ where d ^ A is called the deeision attribute and the 
attributes from A are called eondition attributes. Let lA = {!,••• The 

decision d determines a partition of U into decision classes: {Ci, ..., where 

Cjz = {x ^ U : d{x) = k} for 1 < A: < r(d). Any non-empty set iA C A defines a 
B-information function by Inf^{x) = {(a,a(x)) : a G B for x G U}. A decision 
table A is called eonsistent if Infj^{x) ^ Infj^{y) for any x^y such that 
d(^) 7^ d(y) . 



Let A = {U^A U {d}) be a decision table with real value attributes A = 
{a : a : U VA, Ea C M} and d : U ^ {1, ..., r(d)}. Any pair (a, c) where 
a G A and c G M will be called a eut on E^. For a G A, any set of cuts: 
{(a, (a, C 2 ), . . . , )} on Va defines a partition of Va into sub-intervals 



r>a\ 

-lb I 






la = c^<cf <C^< ... <cl^< 



fa = {K,Cl 

^ka+i ^ ^ bmOi) U [ci,C 2 ) U . . . U Therefore, any set 

of cuts P = UaeA^cL defines new a decision table A^ = (t/, A^ U {d}), where 
A^ = {a^ : a^{x) = i iff a{x) G [of , cf+i), for x e U and i G {0, .., ba}}. A set of 
cuts P is A-eonsistent if it discerns all pairs of objects with different decisions. 



3 Optimal splitting in 

In this section we consider a consistent decision table A= (t/, {a, 6 }U{d}}) with 
two real value condition attributes a, b and binary decision d : U ^ {0, 1}. Such a 
decision table is a representation of the set of points S = {(a(w^), 6(w^)) : Ui G U} 
in the plane partitioned into two disjoint categories S = S\ U 82 ^ We can 
determine such a partition by assigning black and white colors to the points. 
Any cut (a, c) on a (or (6, c) on fe), where c G M, can be represented by a vertical 
(or horizontal) line. The set of cuts is A-consistent if the set of lines representing 
them defines a partition of the plane into regions in such a way that if any two 
points are in the same region then they have the same color. Such a set of lines 
is said to be eonsistent. 
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PROBLEM 2-OS: Optimal Splitting in M? 

Input: A set S of points in the plane partitioned into two disjoint 

categories Ai, S 2 and a natural number T , 

Question: Is there a consistent set of lines such that the partition of the plane 
into regions defined by them consists of at most T regions? 



Theorem 1. 2- OS is N P -complete. 

Proof. It is clear that 2- OS is in NP. The NP-hardness part of the proof is done 
by reducing 3SAT to 2-OS (cf. [5]). 

Let P — Cl A ... A Ck be an instance of 3SAT. We construct an instance 
of 2-OS such that # is satisfiable iff there is a sufficiently small consistent set 
of lines for I<p. The description of I<p will specify a set of points A, which will 
be partitioned into two subsets of white and black points. A pair of points with 
equal horizontal coordinates is said to be vertical^ similarly, a pair of points 
with equal vertical coordinates is horizontal. If a configuration of points includes 
a pair of horizontal points pi and p 2 of different colors, then any consistent 
set of lines will include a vertical line L separating pi and p 2 ^ which will be 
in the vertical strip with p\ and p 2 on its boundaries. Such a strip is referred 
to as a forcing strip ^ and the line L as forced by points pi and p 2 . Horizontal 
forcing strips and forced lines are defined similarly. The instance has an 
underlying grid-like structure consisting of vertical and horizontal forcing strips. 
The rectangular regions inside the structure and consisting of points outside the 
strips are referred to as f-rect angles of the grid. In the figures that follow and 
illustrate the construction of /#, the forcing strips are depicted simply as lines. 
These lines make a grid of rectangles which represent the f-rectangles. 

For each propositional variable p occurring in C use one special row and 
one special column of f-rectangles. In the f-rectangle that is at the intersection 
of the row and column place an instance Rp of the variable configuration^ as 
depicted in Figure 1 (a). Notice that the variable configuration requires at least 
one horizontal and one vertical line to separate the white points from the black 
ones. If only one such vertical line occurs in a consistent set of lines, then it 
separates either the left or the right white point from the central black one, 
what we interpret as an assignment of the value true or false to the propositional 
variable. 



the stripe of ra 
the stripe of r2 
the stripe of r\ 

b) 

Fig. 1. a) The variable configuration, b) The clause configuration. 
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For each clause C in ^ use one special row and one special column of f- 
rectangles. In the region of the intersection of this row and column place an 
instance Rc of the clause configuration^ which is depicted in Figure 1 (b). Notice 
that Rc requires at least three lines to separate the white from the black points, 
and among them at least one vertical. Let C be of the form C = ri V r 2 V rs, 
where the variables in the literals are all different. Subdivide the row of f- 
rect angles of C into three strips corresponding to the literals. For each create 
an instance Reg of the literal configuration^ which consists of one black and one 
white points, of distinct vertical and horizontal coordinates. Place Reg at the 
intersection of the horizontal strip of and the column of the variable of rg if 
the variable is p, then either in the ‘true’ part of Rp^ if = p, or in the ‘false’ 
part of Rpj if = ^p. The pair of points in Reg has their vertical coordinates 
equal to the vertical coordinates of the pair of points in Rc in the strip of 
An example of this construction is depicted in Figure 2. Column Xp. and row 
yp. correspond to variable p^, row ye corresponds to clause C. 



Xpi Xp2 • • • Xp^ . . . Xc 
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Fig. 2. Construction of configurations Rp. and Rc for C = pi V ^P 2 V ^ps 



Let the underlying grid of of f-rectangles be minimal to accommodate this 
construction. We need to add a number of horizontal rows of f-rectangles, so 
that each vertical line will contribute sufficiently more regions than a horizontal 
line, here we mean lines different from those defining the grid of f-rectangles. To 
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be specific, let the number of these rows of f-rectangles be equal to the size of # 
plus 1 (size is the number of occurences of symbols). 

Suppose conceptually that a consistent set of lines IV includes exactly one 
vertical and one horizontal line per each Rp^ and exactly one vertical and two 
horizontal lines per each Re, let Li be the set of all these lines. There is also 
the set L 2 of lines inside the forcing strips, precisely one line per each strip. We 
have W = Li U L 2 - Let the number of horizontal lines in W be equal to Ih and 
vertical to ly. That many lines create T = (//j, — 1) • (/^ — 1) regions, and this 
number is the last component of I<p. 

Next we show the correctness of the reduction. Suppose first that # is satisfi- 
able, let us fix a satisfying assignment of logical values to the variables of The 
consistent set of lines is determined as follows. Place one line into each forcing 
strip. For each variable p place one vertical and one horizontal line to separate 
points in Rp^ the vertical line determined by the logical value assigned to p. Each 
configuration Rc is handled as follows. Let C be of the form C = ri V r 2 V rs. 
Since C is satisfied, at least one Rc,i^ say Rep^ is separated by the vertical line 
that separates also Rp^ where p is the variable of ri. Place two horizontal lines to 
separate the remaining Rep and Rep^ They also separates two pairs of points 
in Rc^ Add one vertical line to complete separation of the points in Rc^ All this 
means that there is a consistent set of lines which creates T regions. 

On the other hand, suppose that there is a consistent set of lines for /^, which 
determines at most T regions. The number T was defined in such a way that two 
lines must separate each Rp and three lines each Rcj in the latter case at least 
one of them vertical. Notice that a horizontal line contributes fewer regions than 
a vertical one because the grid of splitting strips contains much more rows than 
columns. Hence one vertical line and two horizontal lines separate each Rcj 
because changing horizontal to vertical would increase the number of regions 
beyond T . It follows that, for each clause C = ri V r 2 V rs, at least one Rc,i is 
separated by such a vertical line which also separates Rp^ where p is the variable 
of r^, and this yields a satisfying truth assignment. 

4 Optimal discretization in 

In this section we consider the problem of finding a consistent partition of the 
plane using minimal number of cuts. We analyze a following decision problem: 

PROBLEM k-D2: Discretization in by at most k cuts. 

Input: Set S of points Pi,...,P^ in the plane, partitioned into two disjoint 

categories Ni, S 2 and a natural number k. 

Question: Is there a consistent set of at most k lines? 

We show that the problem Set Cover (see [5]) is reducible to A:-D2. 
Theorem 2. k-D2 is NP-complete. 

Proof. An instance of 1 of Set Cover consists of A = {^. 1 ,^ 2 ,...,^^}, ^ = 
{Ni, A 2 , ..., where Sj C S and integer M, and the 

question is if there are M sets from P' whose union contains all elements of 
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S. We need to construct an instance I of A:-D 2 such that / has a positive an- 
swer iff / has a positive answer. The construction of I is quite similar to the 
construction described in the previous section. We start by building a grid-like 
structure consisting of vertical and horizontal forcing stripes. The regions are 
in rows labeled by columns labeled by 

(see Figure 3 ). The region {xs^^Vui) represents Ui G Si and the region (x^. 
represents the sets which include w^. First, for any element Ui G S', define a fam- 
ily Ti = {S^^ , S^2, Si ^_ } of all subsets containing w^. Then subdivide the row 
yu^ into rrii strips, corresponding to the subsets from Ti. In the strip labeled by 
Ui G Sj place one pair of black and white points inside the region (x^. and 
another pair in the column labeled by x^^. (see Figure 3 ). The points in adja- 
cent strips in the region have the same color. In each region {xy.^yy.) 

add a special point in the top left corner with a color different from the color 
of the point in the top right corner. This point is introduced to force at least 
one vertical line across a region. Place the configuration Ry. for Ui in the region 
labeled by {xy.^yy.). Examples of Ry^ and Ry^ where Ry^ = {Si, S2, S4, S5} 
and J^y^ = {Si, S3, S4}, are depicted in the Figure 3 . 




G S5 
G S4 
G S2 
GSi 
G S4 

G S3 
GSi 



Fig. 3. Construction of configurations Ry^ and Ry^ where J^y^ = {Si, S2, S4, S5} 
and = {Si,Ss,S4} 



The configuration Ry. requires at least nii lines to be separated, among them 
at least one vertical. Thus, the whole construction for Ui requires at least T 1 
lines. Let I be an instance of A:-D 2 defined by the set of all points forcing the 
grid and all configurations Ry. with k = M -\- + m + 2 ) as the 

parameter, where the last component (2nTm + 2) is the number of lines defining 
the grid. If there is a covering of S by M subsets Sj^, ..., Sj^, then we can 
construct k lines that separate well the set of points, namely (2n T m + 2) grid 
lines, M vertical lines in the columns corresponding to ..., and 

lines per element w^, for 1 < i < n. 

On the other hand, let us assume that there are k lines separating the points 
from instance I . We show that there exists a covering of S' by M subsets. There 
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is a set of lines such that, for any i G { 1 , n}, there are exactly mi lines passing 
across the configuration which is the region labeled by among 

them exactly one vertical line. Hence, there are at most M vertical lines on rows 
labeled by These lines determine M subsets that cover the whole 

set S. 



5 Methods to learn optimal partitions 



In the previous section we have shown NP- hard ness of discretization problems 
for the case of a decision table with two attributes. This implies the hardness of 
discretization problems with more attributes. In this section we present a heuris- 
tic for semi-optimal partition generation for an arbitrary dimensional space. An 
efficient algorithm for generating a minimal consistent set of cuts was presented 
in [ 8 ], which takes 0[knlogn) time to extract one cut, where n is the number 
of objects and k is the number of attributes. Below we present an algorithm 
finding a consistent set of cuts which minimizes the number of regions. To sim- 
plify the exposition, we present the algorithm for the case of two attributes. The 
algorithm can be generalized to be able to handle more attributes. 

Let A = (L, {a, 6 } U {d}) be a consistent decision table with two real at- 
tributes a and 6 , where Card{U) = n. We assume that the decision d classifies 
the universe U into rn decision classes. Such a decision table can be represented 
as a set of points S — {(a(w^), 6 (w^)) : Ui G D} in the plane painted by 
rn colors. The task is to find a set of vertical and horizontal lines that divides 



the plane into a minimum number of rectangular regions, such that every one 
contains points of the same color. 

Let L be a set of all possible horizontal and vertical lines for the set of 
points S. The main idea of the algorithm is to reduce L by removing from it 
many useless lines without loss of consistency. The use of a partition line I is 
characterized by a number of regions defined by I and a number of point pairs 
discerned by 1. 

The line I is useless if both numbers are "small”. A given region is adjacent 
to line / if / is one of its boundaries. For every line I we introduce a function 
Density[l) being a density of regions adjacent to 1. Let Ri^ft and R[.ig]^t are the 
sets of points belonging to regions adjacent to the line / on the left and right. 



respectively. Let N{1) be the number of regions adjacent to line 1. The density 

function is defined as follows: Density [1) = 

For every partition line /, define two functions measuring its global and local 



discernibility degree as follows: 



GlobolDiscil) = Coxd{[u^v) : d[u) d{yfiu G G Rright} 

LocalDisc{l) = Card{{u^v) : d{u) ^ d{y) Au^v are discerned by I only}. 



A line I can be rejected if Local Disc{l) = 0. A line I is a preferable candidate 
to be rejected if both Density{l) and GlobalDisc{l) are "small”. The algorithm 
is in four steps: 

Step 1. Start with the set L of all possible lines. 

Step 2. Find a partition line I with LocalDisc{l) 7 ^ 0 of minimum value 
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of Density[l). If there are several such lines then select the one with the 
minimum value Global Di sc{l) , 

Step 3. Set L to L — {/}. Update the structure of the regions. 

Step 4. If the set of lines L is reducible then go to Step 1, otherwise stop. 
The algorithm can be implemented to run in time 0(n) per one line reduc- 
tion. Therefore a reduction of R lines can be done in time 0[\R\.n). The time 
performance of the algorithm for a A:-dimensional space will be 0[\R\.kn), 

6 Conclusions 

We have shown that the problems Optimal Splitting and Optimal Discretization 
are NP -complete. We have also proposed an approximation algorithm for the 
optimalization version of Optimal Splitting. The algorithm can be treated as a 
new discretization method, and it can be used in the preprocessing step for other 
classification methods. 
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Abstract. We study the relationship between reduct problem in Rough 
Sets theory and the problem of real value attribute discretization. We 
consider the problem of searching for a minimal set of cuts on attribute 
domains that preserves discernibility of objects with respect to any cho- 
sen attributes subset of cardinality s (where s is a parameter given by 
a user). Such a discretization procedure assures that one can keep all 
reducts consisting of at least s attributes. We show that this optimiza- 
tion problem is NP-hard and it is interesting to find efficient heuristics 
for solving this problem. 



1 Introduction 

Discretization of real value attributes is an important task in data mining, par- 
ticularly for the classification problem. Empirical results are showing that the 
quality of classification methods depends on the discretization algorithm used 
in preprocessing step. In general, discretization is a process of searching for 
partition of attribute domains into intervals and unifying the values over each 
interval. Hence discretization problem can be defined as a problem of searching 
for a suitable set of cuts (i.e. boundary points of intervals) on attribute domains. 

Usually we are looking for consistent set of cuts i.e. preserving the discerni- 
bility relation between objects from different decision classes [10,8]. In previous 
papers we considered the problem of searching for optimal consistent set cuts, 
where optimization criteria were defined by number of cuts (OD-problem). It has 
been shown that such optimization problem was NP-hard [8]. Any discretization 
algorithm reduces the information on decision tables. Our heuristic for OD- 
problem called MD-algorithm produces a new decision table with reducV^ 

only. In some applications based on Rough Sets (e.g. dynamic reduct and dy- 
namic rule methods [1]), where reducts are an important tool, it is not enough 
to obtain the strong rules. 

Hence, in some sense we would like to obtain more excessive set of cuts 
producing new discretized decision table containing more reducts, and, at the 
same time, reducing the superfluous information. In this paper we consider the 
problem of searching for minimal set of cuts such that it saves the discernibility 
between objects with respect to any subset of s attributes. One can show that 
this problem, called s-optimal discretization problem (s-OD problem), is also 
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NP-hard, where the heuristic algorithm is more complicated than in case of OD- 
problem. Similarly to the case of OD-problem, we propose the method based on 
Boolean reasoning to solve the s-OD problem. 

2 Preliminaries 

An information system [11] is a pair A = (t/, A), where 6/ is a non-empty, 
finite set called the universe and A is a non-empty, finite set of attributes ^ i.e. 
a : U ^ Va for a G A, where Va is called the value set of a. Elements of U are 
called ohjeets. Any non-empty set C A defines a B-information function by 
InfB{x) = {(a, a[x)) : a G B for x G U}. 

For any subset of attributes 5 C A, an equivalence relation called the B- 
indiscernibility relation [11], denoted hj IN D[B)j is defined by 

IND{B) = {{x,y) eU xU : WaeB {a{x) = a(y))} 

Objects x^y satisfying relation IND[B) are indiscernible by attributes from 
B. By \-^\iND{B) denote the equivalence class of IN D[B) defined by x. A 
minimal (in sense of inclusion) subset B oi A such that IN D[A) = IN D[B) is 
called a reduct of A. 

Any information system of the form A = (t/, Au{d}) is called a decision table 
where d ^ A is called the decision and the elements of A are called conditions. 
Let Vd = {1, . . . ,r(d)}. The decision d determines the partition {Ci, 
of the universe U ^ where ^ {x : d{x) = fc} for 1 < fc < r(d). The set Ck 

is called the decision class of A. 

For any subset of attributes B C A we define an equivalence relation, called 
the relative B-indiseernihility relation and denoted by IND[B^d)^ as follows 

IND{B,d) = {{x,y) e U x U : {d{x) = d{y)) V {Infsix) = Infsiy))} 

A minimal (in sense of inclusion) subset B of attributes from A such that 
IN D[B^d) = IN D[A^d) is called a relative reduet of A. 

Let A = (t/, Au{d}) be a decision table and B C A. We define a function Ob : 

U , called the generalized decision in A, by ^b(^) = d ^[^]/jvi:)(s)) • 

A decision table A is called eonsistent (deterministie) if card(^^(x)) = 1 for 
any x e U . Otherwise A is ineonsistent (non- deterministic ) , 

3 Optimal Reduct-Discretization Problems 



Let A = (t/, Au{d}) be a decision table where U = {xi, . . . , A = {ai, ..., : 

am : B Fam} d : U ^ {1, ..., r(d)}. Any pair (a, c) where a G A and c G 5ft 






will be called a cut on Va- For a G A, any set of cuts: {(a, (a, C 2 ),..., (a 

on Va = [laAa) O 5ft defines a partition Pa of Va onto subintervals 

where la = < 4 < < . . . < < 



'^kaS-l 



a and Va 
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of cuts P = Pa defines new decision table = {U^A^ U {d}), where 

= {a^ : a^{x) = i a{x) G [c^, for x G D and i G { 0 , A^a}}' 

Two sets of cuts P',P are equivalent, i.e. P'=aP, iff A^ = A^ . The equiv- 
alence relation =a has a finite number of equivalence classes. In the sequel 
we will not discern equivalent sets of cuts. The set of cuts P is A-consistent 
if Oa = where and are generalized decisions of A and A^, re- 

spectively. The A-consistent set of cuts P^^^ is A-irreducihle if P is not A- 
consistent for any P C P^^^. The A-consistent set of cuts is A-optimal if 
card(P^^^) < card(P) for any A-consistent set of cuts P. 

Definition 1 . The set P of euts is ealled s-eonsistent with A (or s-eonsistent 
in short) where 1 < s < card[A) iff for any decision subtable B = (D, iA U {d}) 
with s eonditional attributes, the set of cuts Pn{B x is 'B -consistent 

The 1 -consistent set of cuts will be called locally consistent and the k- 
consistent set of cuts (where k = card[A)) will be simply called consistent. 

Definition 2 . The s-eonsistent set of euts P is ealled s-irreducible iffQ is not 
s-eonsistent for any proper subset Q C P. 

Definition 3 . The s-eonsistent set of cuts P is called s-optimal iff card[P) < 
card{Q) for any s-eonsistent set of euts Q. 

When s — k — card (A), instead of saying that the set of cuts is fc-irreducible 
or A:-optimal, we say that it is irreducible or optimal. We show that any s- 
consistent set of cuts saves all relative reducts with cardinality not smaller than 
s in the following sense: 

Proposition 1 . If B is a s -consistent set of cuts in the deeision table A = 
{U,A U {d}) then for any relative (super-) reduet B of A sueh that \B\ > s the 
set of discretized attributes B^ is also a relative (super-) reduet in discretized 
deeision table A^. 

We illustrate our concepts by Table 1 . One can see that the set of all relative 
reducts of Table 1 is equal to R — {{ui, a2}, {a2, as}}. The set of all possible 
cuts is equal to Ca = U Ca^ U Ca^ where 

= {(ai, 1 . 5 ) , (ai, 2 . 5 ) , (ai, 3 . 5 ) , (ai, 4 . 5 ) , (ai, 5 . 5 ) , (ai, 6 . 5 ) , (ai, 7 . 5 )} 

Ca2 = {(<^2, T 5 ) , (a2,3.5) , (a2,5.5) , (a2,6.5) (a2,7.5)} 

= {(<^3,2.0) , (as, 4 . 0 ) , (as, 5 . 5 ) (as, 7 . 0 )} 

The s-optimal sets of cuts for s = 1 , 2 , 3 are the following 

Pi = {(ai, 1 - 5 ) ,(ai, 2 . 5 ) , (ai, 3 . 5 ) , (ai, 4 - 5 ) , (ai, 5 . 5 ) , (ai, 6 . 5 ) , (ai, 7 . 5 )} 

U {(02,1-5) , (o2,3.5) ,(02,5.5) , (o2,6.5)} U {(03,2.0) , (03,4.0) , (03,7.0)} 
P2 = {(01,3.5) , (01,4.5) , (01,5.5) , (01,6.5)} u {(02,3.5) , (02,6.5)} 

U{(o3,2.0) , (03,4.0)} 

P3 = {(oi, 3 . 5 )} U {(02,3.5) ,(02,6.5)} U {(03,4.0)} 
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Table 1. The exemplary decision table with ten objects, three attributes and three 
decision classes. The discretized table is presented in the right hand size. 



One can see that in the table AP- we still have both reducts: 
but in the table we have only one reduct: 



} and 



4 Boolean reasoning for discretization problems 

Consider a decision table A = [U^AU {d}) where U = {^ 1 ,^ 2 ,...,^^} and 
A = {ai, . . . ,a^}. By . . . , {am^vHr)} denote the set of all 

possible cuts on the attribute for m = 1, ..., A: and by Fa^ = } 

we denote the set of Boolean variables corresponding to them. 

For any objects Ui^uj G U such that d[ui) ^ d{uj) and for any attribute 
a^n G ^ we define the set of cuts which discern Ui and Uj as follows: 

~ ^rn) ^ ^cim ' i^m ('^z) '^m) i^m {'^j) ^m) ^ ' 

By where B G A^ we denote the set of cuts on attributes from B which 
discern Ui and uj as = Ua eB • 

The Boolean function called discern ibility function of objects Ui and Uj 
over B is defined by = \J F^^ where F^^ is the set of Boolean variables 
corresponding to the cuts We can say that the set of cuts P satisfies the 
function iff PcC^-^ 0. 

The discernihility Boolean function for the set of attributes B is defined by: 

^B= f\ V’b- 

d(ui)y^d(uj) 

We say that the set of cuts P satisfies the function iff P satisfies functions 
for all such that d[ui) d[uj). In previous papers we have considered 
the discernibility Boolean function for the table A defined by: = d>A £^nd it 

has been shown that any set of cuts is consistent if it satisfies the function <Pa^ 
We have also shown the following: 
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Theorem 1. The set of euts P is optimal iff the set of Boolean variables cor- 
responding to P is a prime implieant of the funetion ^a- 

This fact allows us to construct an efficient algorithm searching for semi- 
optimal set of cuts by applying the approximation algorithm (greedy algorithm) 
for computing the minimal prime implieant [14] of the function 

5 Properties and generalizations 

In this section we explore some interesting properties of s-consistent sets of cuts. 
Firstly, we discuss the computational complexity of the optimization problem 
related to s-optimal sets of cuts. 

Theorem 2. For a given deeision table A = (D, A U {d}) and an integer s, the 
problem of searehing for s-optimal set of euts is 

1. DTI M E{knlogn) for s = 1; 

2. NP-hard for any s > 2. 

Proof. The Fact 1. is obvious. To prove the Fact 2. we recall the following Theo- 
rem presented in [5]: The problem of searehing for optimal set of euts in deeision 
tables with two eondition attributes is NP-hard. □ 

Let us generalize <Pb introduced in previous section and define Boolean func- 
tion Ps such that any s-optimal set of cuts corresponds to the minimal implieant 
of Pg i.e. 

<Ps= /\ 

\B\=s 

One can see that the number of clauses in the Boolean function Pg is equal 
to (^) times conflict [A) = card{{ui^Uj) : d{ui) = O (n^) being the 

number of pairs of objects with different decision values. Thus, any greedy algo- 
rithm searching for minimal reduct of the function Pg needs O {n^ • (^) j steps 
to compute the number of clauses satisfied by any given cut. The function Pg 
can be rewritten as follows: 



^S= h A 

d{ui)^d{uj) |S|=s 



For any pair of objects w^,Wj such that d[ui) d[uj) we denote the set of 
attributes which discern Ui and Uj by Aij = {a G A : a [ui) ^ a {uj)}. We have 

Proposition 2. If {\Aij \ < k - s + 1) then A|i 3 |=s V’s = i’ai- 

In the consequence we have: 
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Theorem 3. Let ki^ = min{|^^j| , A: — s + 1}. The set of cuts P satisfies the 
function <Ts for any pair of objects Ui^uj such that d{ui) d\ufi), there are 
kij attributes ami ? • • • ■ . ^"^oh that P satisfies all functions , • • • , j ; 

i.e. Pne”]' ^0/or; = 

The last theorem implies that one can construct more efficient discretization 
algorithm then the greedy heuristics. It will be presented in the next section. 
Now we conclude with the fact that s-consistency is monotone with respect to s. 
In particular, it implies that one can reduce the s-optimal set of cuts to obtain 
(s + l)-optimal set of cuts. 

Proposition 3. For any decision table A = {U,AU {d}), card {A) = k, and for 
any integer s ^ ...,k — 1} , if the set of cuts P is s-consistent with A, then P 

is also (s + \) -consistent with A. 



6 The Algorithm Framework 



In previous papers [8,9] we presented a discretization algorithm called MD- 
heuristic ^ for the total optimal (A:-optimal) discretization problem. This is a 
version of the greedy algorithm applying to the Boolean function Fk- The main 
idea of this method is based on a construction and an analysis of a new table 
A* = (t/*, A*) where 



# 

• A* 



{^[ui^Uj) G : dfui) 7 ^ d(wj)| 

{c : c is a cut on A}, where c((w^,Wj)) = 



1 if c discerns w^, Wj 
0 otherwise 



This table consists of O {nk) attributes (cuts) and O (n^) objects (see Table 
2). We denote by Disc[a^c) the discernibility degree of the cut (a,c) which is 
defined as a number of pairs of objects from different decision classes (or number 
of objects in table A*) discerned by c. The MD-heuristic is searching for a cut 
(a, c) G A* with the largest discernibility degree Disc(a,c). Then we move the 
cut c from A* to the result set of cuts P and remove from t/* all pairs of objects 
discerned by c. Our algorithm is continued until 17* = 0. It has been shown in [9] 
that MD-heuristic is very efficient, because it determines the best cut in O [kn) 
steps using O [kn) space only. 

One can modify this algorithm in case of s-optimal discretization problem by 
applying Theorem 3. At the beginning, we confer required cut number kij and 
set of discerning attributes Lij := 0 upon every pair of objects [ui^uj) G U* (see 
Theorem 3). Next we look for a cut (a, c) G A* with the largest discernibility 
degree Disc[a^ c) and remove (a, c) from A* to the result set of cuts P. Then we 
insert the attribute a into lists of attributes of all pairs of objects discerned by 
(a, c). We also delete from t/* such pairs (w^, Wj) that \Lij | = kij. This algorithm 
is continued until 17* = 0. 

^ Abbreviation of ” Maximal Discernibility Heuristic” 
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Table 2. A fragment of the temporary table constructed from A (see Table 1) 



In Figure 1 we show all possible cuts for decision table A (presented in Table 
1). Table A* (see Table 2) consists of 33 pairs of objects from different decision 
classes. For s = 2, the required numbers of cuts kij for all [ui^uj) G U* (see 
Theorem 3) are equal to 2 except = 1. Our algorithm begins by choos- 
ing the best cut (as, 4.0) discerning 20 pairs of objects from A. In the next 
step the cut (ai,3.5) will be chosen because of 17 pairs of objects discerned 
by this cut. After this step one can remove 9 pairs of objects from D* e.g. 
(^ 1 ,^ 0 ), (ai, ay), (ui, ug), ( 0 ^ 2 , 0 . 5 ), ... because they are discerned by two cuts on 
two different attributes. When the Algorithm stops one can eliminate some su- 
perfluous cuts to obtain the set of cuts P 2 presented in Section 3. 
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Fig. 1. Illustration of cuts on the table A. Objects are marked by three labels with 
respect to their decision values. 





552 H.S. Nguyen 



7 Conclusions 

We presented the discretization method with respect to the discer nihility be- 
tween objects and relative redacts of cardinality > s for a parameter s given by 
a user. We proposed the method solving this problem based on Boolean reason- 
ing approach. Also the initial approximation algorithm framework was presented. 
For continuation we plan to adopt very efficient MD-heuristic presented in [9] . 
Acknowledgment: The work was supported by Polish State Committee for Sci- 
entific Research grant t^^STIICOIOII and Research Program of European Union 
- ESPRIT-CRIT2 No. 20288. 
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Abstract. A rough sets approach to the optimisation of product sched- 
ules is proposed. This approach provides a general method for finding 
approximate solutions to a certain class of iVF-hard mathematical pro- 
gramming problems that arise in product scheduling in manufacturing. 



1 Introduction 

By industrial design we mean the development of intelligent solutions for in- 
dustrial and business problems. Rough mereology offers an new paradigm for 
applications in industrial design because our solutions are being expressed in 
approximate terms, that is “in acceptable degree”, unlike previous approaches 
in which an optimal solution was expected. An industrial problem which shows 
most immediate benefit from the rough mereology approach is that of produc- 
tion scheduling in manufacturing. This problem needs to be re-examined for the 
possibility of better solutions that the new paradigm makes possible. 

In this paper I will discuss the relevance of the rough mereology approach to 
production scheduling in manufacturing. In Section 2 the production scheduling 
problem is described. In Section 3 the rough mereology is reviewed. In Section 
4 we demonstrate that the production scheduling problem can be formulated 
within a rough mereology approach. Conclusions are presented in Section 5. 



2 The Production Scheduling Problem 

Products are scheduled for production on several different production lines (pro- 
cessors). Product demands over a certain number of time periods (called the 
planning horizon) are known. Each processor is capable of producing more than 
one product, but only one at a time. A production schedule is a week by week 
plan showing which products are produced on which processors in which time 
periods. 

The event of switching a processor from producing one product to producing 
another is known as a change-over. Its cost is called changeover cost The cost 
of holding a product in the warehouse after it is produced is called holding cost. 
Total production cost is the sum of manufacturing, holding and changeover cost. 

A feasible schedule satisfies demands. An optimal schedule minimises total 
production cost while satisfying demands. Our objective is to produce a close to 
optimal schedule. 

L. Polkowski and A. Skowron (Eds.): RSCTC’98, LNAI 1424, pp. 553-556, 1998. 
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3 What is Rough Mereology? 

Rough mereology [3, 4] is an approach to problems of approximate reasoning 
based on rough set theory extended so that the problems can be allocated to a 
system of agents. It comprises a process of learning about how objects are con- 
structed and a process of classifying objects based on the knowledge acquired in 
the learning process. The basic idea is embodied in the meaning of the statement 
= £■ which is read x is a part of y in degree at least e. But what metric 
do we use to make precise the notion of x being a part of to a degree? One 
answer would be to look at the similarity between the two objects: Information 
about objects is expressed by attributes applicable to objects of a given class 
and values for those attributes for each object of that class. An approximate 
measure for fjt{x^y) is the proportion of attribute values that the two objects x 
and y share in common. 

We assume a hierarchically structured agent system in which an agent synthe- 
sises objects of a particular class from sub-objects sent to it by its children. Con- 
sider the following example from [3] : We are constructing a man from subparts 
of a body (which includes head, arms and trunk) and a pair of legs. There are 
two types of bodies, a skinny body B1 and a fat body B3, and two types of legs, 
skinny legs L2 and fat legs L3. A consistent set of values for the /i's is as follows: 
> 0.25, yiegsiL2,L2) = > C /in.an(^3L2, R1L2) = 

> 0.28, lHegs{Li,L2) = > 0.4, li^an(BiLi,BlL2) = > 0.14. 

Men B3L2 and B1L2 who have the same legs share at least 28% of their attribute 
values in common while men R3L3 and B1L2 who have neither the same legs 
nor the same bodies share only at least 14% of their attribute values in common. 

The e's for a particular object class are treated as values of a decision at- 
tribute and the respective /i’s as values of respective condition attributes. De- 
cisions propagate from the lowest level agents to the top level agent during 
synthesis of an object. Each non-leaf agent chooses from among the possible 
objects that can be constructed from parts sent to it by its children the object 
that most closely matches the desirable specification that has been sent to it 
from above. 



4 Rough Mereology Production Scheduling 

A mapping between the production scheduling domain and the blocks world 
setting to facilitate user acceptance of the decision support system has been 
provided in [1, 2]. Thereafter, we can operate in a two dimensional world inhab- 
ited by a robot that moves blocks around to develop preferable schedules. 

The top level agent constructs a preferable schedule by expanding itself to 
a system of agents as illustrated in Fig. 1. Multiple goals are achieved includ- 
ing those of satisfying customer demands and running the products on a given 
processor in a predefined order. Regarding the latter, it might not be a good 
idea to produce widgets before gadgets on a given processor because widgets 
are black and gadgets are white. The processor would have to be shut down 
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attributes of a preferable schedule 

* simplicity * user - satisfaction 

* learnability * ease of use 

* comprehensibibly * user - friendliness 



Fig. 1. Agent Based Scheduling 



attributes of a schedule 

* number of processors 

* number of products 

* number of time periods 

* which processors can produce which products 

attributes of a feasible schedule 

* weekly production 

* weekly consumption 

attributes of a cost effective schedule 

* holding cost 

* manufacturing cost 

* change over cost 




attributes of a schedule with the inclusion of precedence relations 

* partial ordering on product numbers 

attributes of a schedule with the inclusion of run length restrictions 

* maximum run length 

* number of idle time periods 




Fig. 2. Attributes of Second Level Agents 
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for cleaning after producing widgets before producing gadgets. Otherwise that 
batch of gadgets would turn out to be gray instead of white. 

Attributes of the objects produced by each second level agent are provided in 
Fig. 2. Second level agents expand themselves to agent hierarchies. For example 
the agent that enforces feasibility partitions the schedule into regions and assigns 
each region to an agent. Decomposition formulae are extracted from examples of 
the construction of feasible schedules from feasible regions. A synthesis process 
combines the feasible regions to produce a feasible schedule. 

5 Conclusions 

The rough mereology approach has the following advantages for application to 
the production scheduling problem: 

1. The problem of constructing a mathematical programming (MP) model can 
be formulated as a problem of learning about how objects are constructed 
from sub-objects. 

2. The problem of solving an MP model can be formulated as the problem of 
synthesising objects from sub- objects within an agent based framework. 

3. An infeasible solution to an MP model corresponds to a schedule that fails 
to satisfy customer demands while keeping the cost of the schedule at an 
acceptable level. 

4. General constraints of the rough mereology force a proper design for parti- 
tioning a problem into subproblems for allocation to a hierarchy of agents. 

5. Flexibility to use different approaches to reasoning with uncertain informa- 
tion is facilitated. For example, the second level agents which provide for 
precedence relations and run length restrictions could be based on methods 
for approximate definability of concepts from example schedules whereas the 
other second level agents would benefit from methods for extracting schemes 
of approximate reasoning from the data. 
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Abstract. The meshless strategy of CAE problem partitioning is presented. 
It offers significant decrement of time complexity in comparison with usual 
mesh decomposition algorithms, and therefore may be applied in on-line 
processing. The effective IE graph solid representation is applied as well as the 
stochastic performance forecast system for parallel MIMD computer nodes. The 
main scaling problem is formulated as the optimal fuzzy graph matching and it 
is proposed to be solved by efficient ETPE(A:) parsers. 



1 Introduction 

A successful utilization of multiprocessor computer installations imposes 
parallelization of most complex problems that appear in the discrete mechanical 
analysis of structures (CAE). A suggested course of parallelization is to decompose 
the whole domain and then process each subdomain concurrently. A particular 
decomposition is as good as the resulting tasks are scaled to the available performance 
distribution and as it decreases communication between processors. Although an 
accurate scaling of CAE problem can be obtained only by a decomposition of 
a computational mesh, a rough decomposition may provide a similar speedup, because 
of an inherent uncertainty of an evaluation of virtual computer performance 
parameters. The new, meshless strategy proposed in this paper takes into account such 
an uncertainty. We undertake the following assumptions: 

• Partitioning is performed on a feature-oriented, unambiguous IE graph (see [4,6]) 
representation of the solution domain before the computational mesh is generated. 
It allows us to minimize an interprocess communication involving maximum 
information about solid topology. 

• We make use of synthetic information about computational complexity introduced 
by each part of the problem’s domain, e.g. the space density of degrees of freedom 
(d.o.f), and we take into account uncertainty of their distribution. In particular, we 
know the error distribution for evaluated computational complexity of each task and 
communicational complexity resulting from assembling partial results obtained on 
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interfaces. This information is contained in the density of d.of function p. It may be 
defined by designer’s introductory heuristic analysis of a CAE problem, or basing on 
an earlier solution, using a’posteriori error estimation. 

• We dedicate our strategy to a distributed environment (a computer network) with 
an asynchronous communication medium (e.g. such as in Ethernet fan’s). 
A stochastic forecast is used to predict a distribution of a computer performance over 
a network (cf [8]). 



2 lE-graph Solid Representation Preliminaries 

We aim to represent all kinds of structures like machine parts, buildings etc. and call 
them Engineering Solids - ES. These are compact, bounded and closed subsets of 
having a Lipshitz boundary. 

Definition 1. (cf [4,5]) An Indexed Edge-unambiguous graph, (lE-graph), over E 
and r is a quintuple G = (C E, E, O), where Eis a finite, nonempty set of nodes, 
to which indices have been ascribed in an unambiguous way on the basis of properties 
of a represented object, E is a finite nonempty set of node labels, E is a finite 
nonempty set of edge labels, f is a set of edges, of the form (v,?l,w) where v,wg E, 
?lgE, such that the index of v is less than the index of w, and 0:E-^E is a node 
labeling function. The family of all the IE graphs will be denoted with lEG. 

Definition 2. {Graph inclusion) Let there are two given IE graphs 
v4=(E^,E’^,E,E,0^), and B={Vb,E 5 ,E,E, 05 ). When there is an injection z:E^ ^E^, such 
that 05 °z= 0 ^ and E^ = {(z(v),?l,z(w)) : {vf,w)^E^ }, we say that is a subgraph of 
B or that A is included in B and we denote it by AciB. 

We distinguish two separate classes of node labels: Basic Constructive Solids, 
BCS, (for any ES, BCS is its convex hull; faces of such a convex hull are indexed 
unambiguously (see [5]) ) and Features E = BCS u FEATURE. 

IE graphs are used for representing Engineering Solids - ES according to the follo- 
wing scheme (see [5]). Any engineering solid is modelled with subtracting from BCS 

volumes chosen from a 
predefined set of 
(primitive) Features. 
A subtraction consists in 
placing the so-called 
sweep base (a contour) on 
some BCS faces and 
cutting a volume along 
some direction (see Fig. 
la). The IE graph 
representation is defined 
^ ^ in such a way that node 

labels describe types of BCS-qs and types of sweep bases, whereas edge labels define 






Optimal Stochastic Scaling of CAE Parallel Computations 



559 



positioning of a modelled feature (i.e. placing a sweep base on the face and a way of 
its sweeping in relation to the BCS). For example, let us look at the IE graph shown in 
Fig. lb. The edge connecting a node indexed with 1 {BCSl) and a node indexed with 
2 (corresponding to the „square” through pocket in a solid shown in Fig. la) is 
labelled with 1.2, because the feature interacts with faces 1 and 2 of BCS (see Fig. 
la). The edge connecting a node 1 with a node 3 (a square slot) is labelled with 1.5.2, 
because the feature „starts” from a face 1 „ends” at a face 2 and additionally interacts 
with a face 5 of BCS (the „upper” one). With the scheme we can also represent 
adjacency of features. For example, a square slot represented with a node 3 is 
adjacent with a side (of its sweep base) indexed with 2 to a V-slot (a graph node 4), in 
particular to its (sweep base’s) side indexed with 0 - cf Figs: la and lb. Therefore, 
the corresponding edge is labelled with a.2.0 („a” means „adjacenf ’, 2 and 0 are sides 
of adjacent sweep bases). This scheme has been defined and discussed in a detailed 
way in [5]. In [5] we have also proved that engineering solids treated in such a way 
can be represented with the family of IE graphs in a unique and unambiguous way, i.e. 
there is a one-to-one mapping between ES and lEG. 



3 Decomposition Graps, Computational Complexity Graphs and 
Computer Network Graphs. 

A decomposition graph is an IE graph derived from a graph representing an engi- 
neering solid S and reflects geometry and topology of the particular partitioning of 
such a solid. The image of a decomposition graph throughout the representation 
mapping is the same as the initial image, but the solid’s body is treated as splited into 
a given number of parts. Thus, within a decomposition graph we can distinguish 
several subgraphs. 

Definition 3. A decomposition graph of an engineering solid S^ES is an IE graph 
G=(F,E’,X,r,0) that consists of at least two nonoverlapping subgraphs 
Gi={Vi,Ei,'LX->^i)’> I = 0, / = 1 ,..., M, M > 1 such that in each of them we may 

distinguish a ‘representative’ 5C5'-labeled node v: Vz=l,...,M, Vf. d>iiy)^BCS, and 
the only graph edges linking separate subgraphs, connect their 5C5'-labeled nodes, 
thus: V V e ,w e Vj , ij = 1,...,M, i ^ j 

3 (v,X,w)&E {^^(v)&BCS]and |<1> ^ (w ) e SCSj) . 

Let us denote the set of such edges by : 

^ \{v,X,w)&E\ v&Vi,w&Vj, X&T, = i^j, 

^i{v)eBCSand^j{w)eBCS 

A decomposition graph G of an engineering solid S is denoted with d(S). The family of 
all the decomposition graphs over an engineering solid S is denoted with 

DGs ={G: G = (/(S')}. 
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For example, if we decompose a solid shown in Fig. la in a way shown in Fig. 2a, 
then we obtain a decomposition graph shown in Fig. 2b. 

To be able to forecast a computational complexity of a CAE task we have to 
estimate the number of d.o.f falling into particular subparts of a partitioned volume, as 
well as the number of d.o.f contained in interface surfaces between subdomains. Their 




Fig. B Solid after decomposition 



evaluation will be expressed with the computational complexity graph. In order to 
provide its formal description let us define, firstly, a set of random variables: 

^ ^ /y : Q/ ^ A , z=l,...,M, j=\,...pn < which values are number of 

degrees of freedom in the z-th subdomain, and number of degrees of freedom on the 
y-th interface, where j is an index that enumerates interfaces. Means of H/, 7, are usually 
determined by integrals of the degrees of freedom density function p, mentioned in the 
introduction. The integrals are taken over the z-th subdomain and the y-th interface, 
respectively. Means may be randomized using a simple Gauss error distribution, or 
some specific distribution derived from adoption characteristics. 

Definition A. A computational complexity graph G = | E,E',X,r,0 j corresponding 

to a decomposition graph G=( E,E',X,r,<I)) is such an lE-graph that: 
E = {v;:3vGEAO(v)G5C^}is the node set, X = { H,: ^ A,z = l,...,M}is 

a nonempty set of node labels, E = |/y : ^ A, y = l,...,z?z < is a nonempty 

set of edge labels, A = v,/y , w ) : 3 (v,/1,w)g A e A, y = l,...,z?z < j is 

a nonempty set of graph edges, and O : E 3 ^ g X, z = 1,...,M is the node 

labeling function and y is an index that enumerates interfaces. The family of all 
computational complexity graphs resulting from a decomposition of an engineering 

solid S is denoted with = |g : G g DG^ j . 

Nodes of a computational complexity graph correspond to subdomains of 
a particular decomposition and they reflect a complexity of computations combined 
with solving a problem within them. Edges of such a graph relate to predicted 
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interprocess communication resulting from assembling partial results obtained on 
interfaces of subdomains. 

The last of three graphs introduced in the paragraph is a computer network graph . 
In our approach a physical heterogeneous parallel computer consists of computer 
nodes v, z=l,...,Mand an asynchronous communication medium which connects all 
nodes (e.g. Ethernet LAN technology). We assume that this computer is 
nondecomposable what means, that can perform only one operation of our CAE 
application at a time. We assume that each task of our application is atomic z.e. should 
be evaluated sequentially, on a single machine, without any communication with other 
tasks. Moreover, we assume that there exists the pattern task for the considered class 
of CAE applications. Let us introduce (see [8]) the stochastic vector 

\tA , T^={ |, that refers to the virtual computer. is the 

execution time of the pattern task on the z-th machine in the zz-th time step, while 
//denotes the starting point index. may be evaluated as a discrete, Markow, 

periodic process. Its evolution determines the formula fl^ (zz + 1)= P/fl^ (zz), 

7 = 1, . . . , M, zz = //,// + !,... where fl^ (zz) is the probability distribution for and 

Pj is di k X k transition probability matrix identified for k distinct states of z-th 

machine loading, and for finite set of time steps (due to process periodicity). Let us 
define new random vectors: 

Y r • T r ■ 1 ■ 1 

Xji,z X^,z -~j ? 7 = 

expresses the mean execution time of the pattern task during time horizon Z, 
while is a random vector of power coefficients, components of which will serve 
for labels of the computational complexity graph nodes. In [3] we have shown that 
computer network structure can be represented with the family of IE graphs, which 
means that such a representation model preserve good computational properties of its 
processing schemes. Therefore, we define a computer network graph as an IE graph. 

Definition 5. A computer network graph is a complete IE graph 
H ~ E// , E S ^ , E^ , ? where ~ . i — I, . . . , Tf ^ is the node set, 

- \x\z ’ ^ is the set of node labels, is a nonempty set of edge 

labels, Efj = ^Vh, /I G r, z ,7 = 1 ,..., M, z < 7 I is a nonempty set 

of graph edges, : EJ 3 ^ xh,z 7 = ^ is the node labeling function. 



4 Scaling of CAE Tasks as a Stochastic Control Problem 

Having introduced the necesseary formalism we may formulate scaling of a CAE 
work as a stochastic control problem. To factorize adjustment of a particular 
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decomposition to the presumed distribution of computer performance over a network 
we will utilize the functional: 



F{G,X^i,z,(^)=E 



I 



Xji,z 









V J 

where (J is a permutation of an M-element node set of a computational complexity 
graph. E denotes the expected value operator. 



For a given computational complexity graph G e wq denote: 




min F 

aslM 







where is a group of permutations of an M-element set, and is the family of all 

computational complexity graphs resulting from decomposition of an engineering solid 
S. F can be understood as the distance function (metrix) between Ge and H. 

The optimal stochastic scaling problem consists in finding such a G^^^ g DG^ 



for which Gopt ^ satisfies the minimum property and the constraint: 



{popt p,z)~ p,z\ 



), e\ 


F 


r \ 





< ComRange , 



where: / = random variables expresing communicational 

complexity and approximating computational load of an entire work for G. 
ComRange stands for the maximum admissible communication complexity 
contribution. 



5 Syntactic Pattern Recognition for Efficient Decomposition 
and Allocation 

As one can easily notice, we have been able to define the optimal stochastic 
scaling problem as a problem of finding a graph that satisfies a certain property in a 
set of allowable graphs. Since we have been able to define all the representation 
models (decomposition graphs, computational complexity graphs, and computer 
networks graphs) as the special cases of the family of IE graphs, we can make use of 
very good computational properties of this family. It is well-known that, in general, 
graph processing schemes are based on graph matching, which makes them inefficient 
(a non-polynomial time complexity), due to NP-completeness of the graph 
isomorphism problem. Fortunately, IE graphs used in our model belong to a class of 
ETPL(A:) graph languages with a polynomial membership problem, which means that 
processing schemes based on graph matching are computationally efficient. 
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Theorem (Theorem 5.1 in [3]) The algorithm of parsing ETPL(A:) graph grammars 
has a time complexity 0{n^), where ^ is a number of IE graph nodes. 

It means that searching of any IE graph in a set of IE graphs (treated here as the 
ETPL(A:) language) made with the ETPL(A:) parser in 0{n^) time (see [4]). However, in 
order to apply such an efficient scheme, we have to discuss two key issues that 
condition a successful use of such a syntactic pattern recognition-based processing 
scheme. The first issue consists in defining an efficient processing scheme for 
generating a decomposition graph from the initial one (cf Figs: lb and 2b). From the 
syntactic pattern recognition point of view, this problem resolves itself into the 
problem of translating one formal language into another according to a set of 
predefined rules. 

Fortunately, for ETPL(A:) graph languages used for a solid representation, this 
„translation” problem is solved by defining an efficient Syntax-Directed Translation 
Scheme, SDTS (cf [5], pp 416-422). In [5], we have shown also that the lE-graph 
based representation scheme is constructed in a way allowing us to derive 
„translating” rules in an easy and intuitive way. For example, let us come back to our 
examples shown in Figures 1 and 2. Fet us notice that we have decomposed the solid 
symetrically with a plane parallel to its faces 3 and 4 (cf Figs: la and 2a). Then, 
instead of a graph shown in Fig. lb we have a (decomposition) graph shown in Fig. 
2b. The graph edges labelling exhibits this symmetry. Instead of the edge (1, 1.2, 2) 
pointing a square through pocket we have received two edges labelled with 1.3.2 and 

1.4.2 pointing to two square slots being a result of the pocket decomposition. These 
labels (1.3.2 and 1.4.2) can be „computed” knowing that we „decompose” the label 

1.2 with the plane parallel to faces 3 and 4 of the original solid. A reader easily can 
see other possibilities of such computations made for our example as well as 
a possibility of defining „general” rules translating a graph before a succeeding 
decomposition into a graph after it with such a representation scheme. These „general” 
rules can be then formalized in the form of an efficient syntax-directed translation 
scheme. 

The second problem concerns an extension of a „pure” IE graph, for which an 
efficient syntactic pattern recognition schemes has been defined (see [4]) into 
a representation that allows one to find an optimal graph basing on certain 
measurements added to its structural elements, i.e. its nodes and edges. (In our case 
these measurements are of the stochastic nature.) The key questions are as follows: 

• Is it possible to add such measurements to IE graph structures that can be used 
for comparing „ distances '' between graphs and this way decide which one is „ better’' 
according to some optimalization criteria ? 

• If we add such information to IE graphs and we use parsing processing scheme, 
does the computational complexity increases? 

Answers for both questions are satisfactory. In [2] „fuzzy” (in particular random) 
IE graphs have been defined and a way of their comparison based on metric 
measuremets has been suggested. For processing such fuzzy IE graphs, the (efficient, 
0{r?)) error-correcting parser finding the „best matched” (according to predefined 
criteria, in particular in case to the distance defined by F ) graph out of the set 
(language) of IE graphs has been constructed. 
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6 Concluding remarks 

• Some phases of the presented optimal partitioning strategy have been succesfully 
implemented. The rough solid partitioning algorithm using IE graph representation 
was described in [7]. Error-correcting parsers for fuzzy IE graphs were implemented 
in [2]. The Markov performance forecast system was tested for a network of SUN 
workstations (see [8]). Its advantage was confirmed for simple (homogenous) 
decomposition and for task migration strategy (see [9]). 

• The computational efficiency of such a meshless strategy, being its biggest 
advantage, results from the use of a class of IE graphs as a formalism of 
a representation of both: decomposed solids and computer network structures. The IE 
graphs belong to the family of ETPL(A:) languages with a polynomial membership 
problem (see [6]), which results in 0{n^) time complexity of underlying processing 
schemes (see [4]). The rough partitioner mentioned in [4] is O(^). 

• The proposed strategy is dramatically less complex then mesh partitioners (see e.g. 
[10]), moreover it allows to generate the computational network in parallel which 
additionally increases the overal speedup of a parallel CAE computation. It also 
restricts an interprocess communication to the acceptable fraction of the entire effort. 

• The presented approach is adjusted for the problem partitioning in case of discrete 
meshless solving methods (see e.g. [1]). 
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Abstract. We discuss how to use Case-Based Reasoning (CBR) philoso- 
phy for solving various problems specified by complex objects represented 
by means of hierarchical information systems [3]. We show how to use 
this kind of knowledge base for the recognition of novel cases. Next we 
show how to identify new problems and how to use and adapt methods 
which were successful in past situations to the new ones. All issues are 
illustrated by examples, which are here some elementary mathematical 
tasks. 



1 Introduction 

The idea of Case-Based Reasoning (CBR) systems is to solve new problems by 
adapting previously successful solutions for similar problems. The process is rep- 
resented by the four steps: Retrieve^ Reuse^ Revise^ Retain creating a schematic 
cycle [1]. The most difficult problem is related to discovery of cases similar to 
a given one, so that on the basis of algorithms corresponding to the extracted 
similar cases an algorithm for the given case can be constructed. Hence a prob- 
lem arises how to construct a knowledge base relevant for this task. Here we 
have two pragmatic measures: the functionality and the ease of acquisition of 
the information represented in the case [2]. 

In this paper cases are represented by means of hierarchical complex objects 
together with algorithms (strategies) for transforming them. Any complex object 
is defined by hierarchically structured attributes, on the basis of which (and using 
expert knowledge), some additional characteristics of the object are created. 
Any case (object) has assigned a strategy (algorithm) or a family of algorithms 
working on it. The algorithm can start computation if values of its input variables 
(parameters) are specified properly by the object representation. Algorithms 
create a hierarchical structure, starting from the very precise ones referring to 
the objects defined exactly to general ones for the objects defined more generally 
or in a vague way. To extract successfully an algorithm for the new objects 
(cases) they must be properly decomposed using attributes. This causes that 
even if the algorithm for the similar object is not to be reused exactly, it can be 
adapted. Adaptation of algorithm is a transformation, which let to replace some 
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its segments by the proper ones for the case, taking characteristics of such object 
into account. The idea of this paper is to show a method for construction of such 
hierarchical complex object knowledge base by proper case decomposition, and 
adapting methods of algorithms known for some cases similar to a new identified 
case, to receive the algorithm corresponding to that case. 

2 Information systems for complex objects 

An information system is a pair A = (t/. A) where is a set called the universe 
of objects (cases) and A is a set of attributes, any attribute a € A is a mapping 
on the universe U. With every attribute a € A we associate a set of its values 
Va " domain of a. We let V = U{Va : a € A} [4]. 

The task (problem) can be characterized as an object. Information vector of 
any object O € 17 is defined by: InfA( 0 ) = {(a, a(0)) : a € A}. For example 
when we consider the problems of elementary mathematics, the attributes can 
be found as types of geometrical figures, edge lengths, relations between figures, 
equation or inequality variables etc. The attribute values are the specific values of 
the attributes taken from the task contents. For example for the attribute ^type 
of a geometrical figure^ its value can be - Tircle^ for the attribute Tadius^ its 
value can be - ^2 cm\ while for the attribute Area^ its value can be - Anknownk 
One can consider a new information system A' = {U^ A') where some attributes 
corresponding to attributes from A can have values Yes, No or Alg. If for example 
a{ 0 ) =Yes, b{ 0 ) =Yes and c{ 0 ) = Alg then it means that the dependency a, 6 ^ 
c is true in A and Alg is a pointer to the algorithm which returns the value c if 
the values of a,b are given. 

Example 1 . Let us to consider simple geometrical problems concerning the 
circle. They can be represented by a table (Table 1), columns of which can 
be seen as attributes, rows as objects and table entries are attribute values. 
Elementary objects are the elementary problems (like compute circumference 
having the radius length), attributes are radius ^ diameter,^ circumference^ area; 
attribute values are Yes - (the attribute has a value) or Alg (one can obtain the 
value of attribute by the algorithm pointed to). 

Table 1. (Circle) 



object 


Oi (radius) 


U2 (diameter) 


az (circumference) 


U4 (area) 


Oi 


Yes 


Alg 


Alg 


Alg 


02 


Alg 


Yes 


Alg 


Alg 


03 


Alg 


Alg 


Yes 


Alg 


04 


Alg 


Alg 


Alg 


Yes 



For any object from Table 1 one can obtain values of all its attributes using 
known dependencies. An attribute value can be equal to Alg only when other 
attributes have values allowing to compute the unknown value of the attribute 
using the algorithm pointed by Alg. For objects 0i,...,04 from Table 1 the 
same algorithm for solving equations: Area = II(Radiusy^; Circumference = 
2 II{Radiu 8 ); Diameter = 2 {Radius) can be used. We have three equations, to 
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solve them we may have three unknowns. If the value of the attribute is Yes we 
may substitute to the equation a specific attribute value taken form the case 
description. 

This is an example of objects which we call simple or elementary objects. 
Any simple object is represented by the elementary attributes. Elementary at- 
tributes are the primary concepts of each field, for example primary concepts 
of elementary mathematics (geometry). Obviously, among elementary mathe- 
matical problems we can find other such simple objects, for example concerned 
with other geometric figures (triangles, tetragons etc.). The representation of 
such objects is made in this same way as in the above example. Simple objects 
described by the same set of attributes are represented by means of information 
system. Example 1 shows information system = (UcjAc) for the problems 
dealing with circle. Index 'A represents the type of object from the system. Ta- 
bles which represent information systems are labeled also by the object type of 
the system. For example objects in Table 1 have type Circle. 

Taking two (or more) simple objects into account one can construct new 
object called a complex object. Such a complex object is described by the at- 
tributes representing information about simple objects used in its construction, 
attributes describing relations between these simple objects, and some extra at- 
tributes specifying the so called characteristics of such object, defined by the 
expert. 

Let us consider two information systems for objects of some types Ai = 
(L/i,Ai), A2 = {U2jA2)j where Ai D A2 = 0. The indices 1 and 2 represent 
the types of the objects. We may construct a new information system A(i^2) 
by composition of objects from information systems Ai and A2 in the following 
way: A(i^2) = (^(1/2)? 2)? /) where / represents a decomposition function 
/ : ^(1,2) ^ U1XU2 (with projection functions /i, /2 for / defined by /i(0) = 
/2(G) = where /(O) = (0^0^^); the complex object O is constructed out 
of objects O^^). 

Information system A(i^2) points to the information systems Ai, A2 corre- 
sponding to types of components (of type 1,2) of objects in A(i^2)- Links from 
subset of the universe of the information system A(i^2) lo the subsets of 

the universes I7i, IJ2 of the information systems Ai, A2 are specified by attribute 
value vectors. Sets of attributes A' — {ai,...,a„.} C Ai^B' — {61, ...,6^^} C A2 
specify subsets of the universes Ui , U2 proper for the decomposition of com- 
plex object, while the set of attributes C specifies relations between objects 
from which the complex object has been constructed. An information system 
for complex objects can be also represented by a table (see Table 2). Any value 
vector of attributes ai , . . . , specifies a subset of the universe Ui^ defined by all 
objects from Ai consistent with this value vector; any value vector of attributes 
bij ... jbm specifies a subset of the universe U2j defined by all objects from A2 
consistent with this value vector; and any value vector of attributes ci, . . . ,c^; 
specifies the relations between objects from the universe Ui and U2- The set C of 
attributes is a set of relational attributes c of the form c : ^(1,2) ^ = {d? !}• 

For example such attribute can be defined by relations r CUix U2in the follow- 
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ing way c( 0 ) = 1 ^ (/i (O), / 2 ( 0 )) € r. Any value vector of attributes di^. . . ^di 
is defined by the expert and specifies the characteristics of the complex object, 
taken from the values of relational attributes, types and special attributes of 
objects from which complex object has been constructed. The value vector of 
such attributes can be found as a description of such characteristic, concrete 
attribute value, or pointer to the algorithm which can compute concrete value 
of such attribute. 



Table 2 . ( 1 & 2 ) 



object 


cti 




Ojfi 


hi 




hrn 


Cl 




Cfc 


di 




di 


Oi 


Vi 




Vn 


W\ 




Wm 


Ui 




Uk 








02 






V'n 


w[ 




Wm 


u'l 






s[ 




A. 



Information system A (1^2) can be generalized to a new information system 
A(gi^ G2) by the generalization of types and attributes of objects represented 
by the system, and generalization of relations between these objects. 

For example information system describing objects constructed by an object 
of the type Triangle and an object of the type Rectangle^ can be generalized to 
information system describing objects constructed by two objects of the type 
Figure. We may note that after such generalization, information system A(i^2) 
can be linked to new information systems for objects of generalized type. 

Expert can make compositions of various defined objects (simple and com- 
plex). In each case the method is the same, as it was described for composition 
of two simple objects. A complex object constructed by a composition of some 
other complex objects can be decomposed in various ways if we do not have any 
information about possible methods of such decomposition. If the decomposition 
of the complex object is not appropriate, the algorithm assigned for it will fail. 
That is why formulation of relational attributes and its values has influence on a 
proper decomposition. For example for the complex object O constructed from 
two other complex objects 0i,02, the only information on relations of objects 
Oi, O2 is not sufficient. To make a proper decomposition of object O, we should 
have additional information about objects 0i,02 and their interrelations as 
well. 



Figure 1 . 




Example 2. What is the area of the figure created by inscribing a figure A 
into a figure B? Figures A and B are any of the geometrical figures (like circles. 
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triangles, tetragons, pentagons) represented by different values of attributes. 
The inscribing types of a figure A into a figure B can be different as well. Some 
possible examples are given in Figure 1. 

To simplify the task we will focus only on cases 2-6 (Fig. 1). To solve such 
kinds of problems we need to have in the knowledge base a description of simple 
objects, more complex objects and finally the description of the main complex 
object (general contents of the problem). An example of simplified (with reduc- 
tion of the amount of possible attributes) collection of such simple objects and 
relations among them is presented in Tables 3,4,5; labeled by types Triangle j 
Rectangle^ TriangleEz Rectangle of objects. 



Table 3. (Triangle) 



object 


a[ (subtype 
of object) 


7 

(edge a) 


7 

(edge b) 


7 

(edge c) 


7 

(height ha) 


7 

(height hh) 


7 

0/^ 

(height ho) 


7 — 

(area) 


o\ 


equilateral 


Yes 


Alg^ 


Alg^ 


Alg^ 


Alg^ 


Alg^ 


Alg’ 


0 ’^ 


ordinary 


Yes 


No 


No 


Yes 


No 


No 


Alg’ 


Ol 


ordinary 


No 


Yes 


No 


No 


Alg^ 


No 


Yes 




ordinary 


Yes 


Yes 


No 


No 


Yes 


No 


Alg’ 



Table 4. (Rectangle) 



object 


ai (subtype of object) 


U2 (edge a) 


U3 (edge b) 


aq (diagonal d) 


afs (area) 


O'l' 


square 


Yes 


Alg" 


Alg" 


Alg" 


Oi' 


ordinary 


Yes 


Yes 


Alg" 


Alg" 


0^' 


ordinary 


Yes 


Alg" 


Yes 


Alg" 


O'l 


ordinary 


Alg" 


Yes 


Yes 


Alg" 



Table 5. (TriangleSz Rectangle) 



object 


ai 

edge 


®2 

height 


as com- 
-mon edge 


^>1 

edge 


^2 

height 


Cl 


C2 

(vertex A) 


C3 

(vertex B) 


C4 

(vertex C) 


di 


d2 


Oi 


No 


Yes 


Yes 


No 


No 


inscr 


Yes 


Yes 


Yes 


chari 


Algi 


O 2 


Yes 


Yes 


Yes 


No 


No 


inscr 


Yes 


No 


Yes 


char 2 


Alg 2 


O 3 


No 


Yes 


No 


No 


No 


inscr 


Yes 


No 


No 


char 3 


Alg 3 


O 4 


Yes 


No 


Yes 


No 


No 


inscr 


No 


No 


No 


char 4 


Alg 4 


O 5 


No 


Yes 


Yes 


No 


Yes 


circuni 


Yes 


No 


No 


char^ 


Alg 5 



Attribute ci describes the type of relation, attributes C 2 — C 4 specify whether 
the vertices, of figures extracted from a given object, coincide. Any value vector 
of attributes ai points to the object subset of Table 3 (for example the subset 
pointed by object O 2 is { 0 [, O 4 }); any value vector of attributes i>i, 62 points to 
the subset of Table 4 ; attributes ci - C 4 are relational ones (specify for example 
possible methods of inscribing a triangle into a rectangle); and finally attributes 
di,d 2 precise a characteristic of a complex object. For example attribute di for 
object O 2 ? has a value (char 2 ) which, among others, describes that there is a 
common edge (between coinciding vertices of components), one rectangle edge 
is a height of the triangle to the common edge and some consequences of this 
fact. Attribute specifies an area of the figure created by inscribing a triangle 
into a rectangle. Value of attribute d‘z — Algi is a pointer to the algorithm, 
which returns a concrete value of the area for the created figure. Algorithms 
Algi refer to complex objects. These algorithms are described by sequences of 
operations Alg^ = (opi, . . . , op^). Some of these operations refer to subproblems 
dealing with concrete component objects of a complex object. To solve these 






570 J. Wierzbicki 



subproblems complex object must be properly decomposed to its component 
objects, characterized by special attributes and their value vectors. That is why 
such algorithm is characterized by the type of component objects, value vector 
of attributes specifying sets of component objects, relations between them, and 
known characteristics of the complex object. We call such type of algorithm 
a detailed algorithm. The value vector of attributes specifying sets of proper 
component objects is pointing to the subset of the universe for the information 
systems corresponding for component objects. Table 5 can be generalized by 
generalization of the object type Triangleh Rectangle and attributes a\ — i>i, 
i>2 to the a general object type - FigureSz Figure^ while attributes ci — C4; di , are 
generalized to attributes c, d. Table 6 presents the description of main complex 
object of a general type defined by attributes c,d. 



Table 6. {Figureh Figure) 



object 


c (relation) 


d (area of the figure created by 
inscribing two figures) 


oi 


inscribe 


ALGi 


oL 


circumscribe 


ALG2 



This general table can be linked not only to Tables 3, 4 as Table 5 do, but 
to any other tables for the objects whose type can be generalized to the type 
Figure, 

Value ALGi of attribute d is a pointer to the algorithm, which returns a 
concrete value of an area of the figure created by inscribing two other figures. 
Algorithms corresponding to ALGi, refer to complex objects of a general type. 
These algorithms are in fact collections of equivalent algorithms corresponding to 
different decompositions of the complex object. Any algorithm choice from these 
collections can require different input information about components of complex 
object. We call algorithms of such type general algorithms. If decomposition of 
the complex object is not appropriate, the detailed algorithm assigned for such 
decomposition fails. In such situation the general algorithm let to assign another 
different corresponding algorithm appropriate to other possible decomposition of 
the complex object. In Example 2 object OJ of the type Figured: Figure (Table 
6) can be decomposed into two objects or to few objects of type Figure, For 
example case 3 (Fig. 1) can be decomposed to one triangle and one rectangle; 
or to three triangles and one tetragon etc. If first decomposition is not allowing 
to compute area, one may try an algorithm for the second decomposition. 

3 Identification of the new object with the similar ones 
coded in the knowledge base 

The problem of object identification in the CBR cycle is called Retrieve. The Re- 
trieve process must be proceeded by proper representation of the new object. Du- 
ring implementation process the new object is matched against the constructed 
knowledge base. To do this it should be decomposed up to some level. Next one 
may find descriptions of similar known objects and adapt their algorithms to 
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our case. Decomposition to more general modules, for example to the general 
object type and relations of its components, gives a little information about such 
object, and it can be difficult to assign correct detailed algorithm due to the lack 
of some attributes values. Decomposition to the detailed attributes, may cause 
difficulties with adaptation of general strategies. The retrieving starts from the 
identification of proper information system by comparing of the new object type 
with types of objects from the information systems. Next we specify subset of 
the universe Ui for chosen information system A| = (Ui^Ai) by comparing the 
attribute value vector of the new object with objects from the universe Ui. We 
select a similar object or a subset of similar objects with maximal attribute 
value vector consistent with value vector of the new object. One can find here 
a problem of selecting of proper algorithm for the case, when the subset of sim- 
ilar objects with different algorithms have been chosen. For example for Table 
5 we may obtain a subset of objects {0i,0s,0^}. For such situation the most 
promising object for the success, must be selected. This allows to reduce nonde- 
terministic choices in object identification. This reduction can be done by using 
weights of attributes defined by experts. 

4 The reuse strategies of similar objects to the new 
object 

This problem in the CBE. cycle is called Reuse. One may find here a few sit- 
uations of reusing and adaptation of algorithms: (i) the exact matching of the 
object (case) from the information system and the new one; (ii) the new object 
(case) is similar to a case from knowledge base, i.e. they have the same values on 
sufficiently many attributes; (iii) the new object (case) is similar to a few cases 
from the same or different information systems. In the first situation one can 
reuse algorithm without any adaptation. In the second situation algorithm has to 
be adapted, taking the characteristics of such object into account. The method 
of algorithm adaptation by analysis of object characteristics is a transformation, 
which let to substitute some operations of the algorithm with different ones or 
to rearrange these operations to the more proper ones for the case by taking 
these characteristics into account. The rearranging process replaces or modifies 
the chosen operation to the new one by taking some additional relations between 
object components specified by the characteristics. For example we may count 
first the area of the tetragon instead the area of the triangle, if this two figures 
are well related. In Example 2 we may easy adapt, by the use of this method, 
algorithm for case 5 (Fig. 1) to case 4 (Fig. 1). In the third situation algorithm 
has to be adapted as well, but the methods of adaptation are much more compli- 
cated here, due to fact that characteristics of many objects must be taken into 
account. 

The algorithm adaptation by analysis of object characteristics returns a so- 
lution, when the given new object was properly decomposed. For the case of 
unprop erly decomposition such adaptation method fails, and we have to refer 
to the general algorithm. From the collection of equivalent detailed algorithms 
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we may choose new algorithm corresponding to different decomposition of the 
complex object. If the decomposition is proper algorithm can be adapted to the 
new object. Specification of the object is done by given attributes and its value 
vector. Components given by the formulated problem, not in all cases are the 
ones to be matched for in knowledge base. Sometimes the proper decomposition 
elements can be hidden in the contents of the task. In Example 2 we may note 
that for case 3 (Fig. 1) characterized by some attributes, case 7 (Fig. 1) can be 
more similar than case 6 (Fig. 1). During Reuse process, the operations of a new 
algorithm are temporarily stored, and the operations which were undertaken 
as wrong are not stored. The final reduction of some left wrong operations is 
done during process of renewal execution of a new algorithm operations, which 
is called in the CBR cycle Revise. In Revise process we check whether the result 
of each operation is needed to execute some of the next operations, otherwise 
such operation will not be stored. New algorithm is stored by Retain process. 
Conclusions. In the paper a construction of a hierarchical knowledge base which 
seems to be well suited for CBR systems was shown. In implementation we use 
the object oriented methodology. The main problem is a proper decomposition 
and definition of the object which reflects the case. Another problem is the ease 
of acquisition of the algorithm, which could be achieved by the right structure 
of algorithms construction, and right object decomposition. Mentioned methods 
of algorithm adaptation were described just to show some abilities of such ap- 
proach to the problems dealing with complex objects, and are not the only ones. 
According to these ideas the system is constructed, which contains much more 
methods of algorithm adaptation and object decomposition as well as recogni- 
tion. Simple mathematical examples were for the simplicity of understanding 
the methods taken, however the system constructed in the way presented can be 
adapt to other fields. 

Acknowledgments. The author is due to thank Professor Andrzej Skowron 
for formulating the subject and for his numerous discussions and helpful critical 
remarks throughout the investigation. 
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[ Abstract.] 

General functional decomposition has important application in 
many fields of modern engineering and science. Its practical 
usefulness for very complex systems is however limited by lack 
of an effective and efficient method for selection of the appropri- 
ate input supports for sub-systems. In this paper, an effective 
and efficient heuristic method for input support selection is 
proposed and discussed. The experimental results demonstrate 
that the method is able to construct optimal or near optimal 
supports efficiently even for large systems. 



1 Introduction 

Decomposition is a central activity in analysis and design of complex systems. It 
is fundamental to many fields of modern engineering and science. Strong stimuli 
for developing decomposition techniques come from such areas as pattern anal- 
ysis, knowledge discovery, machine learning, data mining and decision making, 
but also from logic synthesis in computer-aided design of very large integrated 
circuits (VLSl-CAD) [2]. 

Functional decomposition consists of breaking down a complex system into 
a network of smaller and relatively independent co-operating sub-systems, in 
such a way that the original systems behavior is preserved. It can be used in 
all fields mentioned above. The motivation for using it in system analysis and 
design is to reduce the problem complexity and to find well structured network 
of coherent sub-systems. A complex system is decomposed into a network of 
smaller subsystems, such that each of them is easier to analyze, understand or 
synthesize. Although the multi-level functional decomposition gives very good 
results [7] [8] , its practical usefulness for very complex systems is limited by lack 
of an efficient method for the construction of sub-systems. The decompositions 
quality heavily depends on the effectiveness and efficiency of the input support 
selection for subsystems. However, the commonly used systematic method of 
input support selection, based on checking of all possible supports, is inefficient 
for larger problem instances. 

In this paper, an efficient heuristic method for input support selection is 
proposed. It is based on the analysis of information relationships in a considered 
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system. Application of information relationship measures for construction of the 
input supports allows us to reduce the search space to a manageable size while 
keeping the high-quality solutions in the reduced space. 

After introducing some basic theory, the proposed input support selection 
method is presented. Subsequently, some experimental results are discussed, 
which are reached with a prototype tool that implements the method. 

The experimental results demonstrate that the method is able to construct 
optimal or near optimal supports efficiently, even for large systems. It is much 
faster than the systematic method while delivering results of comparable quality. 



2 Functional decomposition 



’’Partitions” with non-disjoint blocks are referred to as rough partitions [5] [6] 
(r-partitions) or set systems [1], We recall some information on partition-based 
modeling that is necessary for understanding of the paper. For example, the 
function F of Table 1 can be represented by the following set of r-partitions: 



F{xi) = {1,2, 4, 5, 8, 9; 3, 6, 7; 10}, F{x2) 
P{xs) = {1,5, 9; 2, 4, 8; 3, 6, 7, 10}, P{xi) 
PF= {9, 10; 3, 4, 6; 5; 5, 8; 1,7; 2, 8}. 



{1, 2, 3,4,7,8;4,5,6,7,9; 4,7,10}, 
{1,3, 4; 2, 8; 5, 7; 6, 9, 10}, 





Xi 


X2 


X3 


X4 


yi 


V2 


ys 


~T 


“0“ 


“0“ 


IT 


IT 


1 


1 


0 


2 


0 


0 


1 


1 


1 


1 


1 


3 


1 


0 


2 


0 


0 


1 


1 


4 


0 


- 


1 


0 


0 


1 


1 


5 


0 


1 


0 


2 


1 


0 


- 


6 


■ 1 


1 


2 


3 


0 


1 


1 


7 


■ 1 


- 


2 


2 


1 


1 


0 


8 


■ 0 


IT 


1 


1 


1 


- 


1 


9 


■ 0 


1 


0 


3 


0 


1 


0 


To 








T 


0 


1 


0 



Table 1. Function table of the multiple- valued, 
multiple-output, incompletely specified discrete function 
F. 



The product of r-partitions F = P{x 2 ) • F{x^) (computed by finding products 
of the r-partitions’ blocks) is as follows: P = {1; 2, 4, 8; 3, 7; 5, 9; 4; 6, 7; 7, 10}. 

In this way, various information streams in discrete information systems can 
be modeled using r-partitions. 

Let A and B be two subsets of X such that AU B = X. Assume that the 
variables xi, . . . , have been relabeled in such a way that A = {xi, . . . , Xr} 
and B = {x^-g^i, . . . , x^}. For an n-tuple x, the first r components are denoted 
by x"^, and the last s components, by x^. 
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Let F be a Boolean function, with n > 0 inputs and m > 0 outputs, and let 
(A, B) be as above. Assume that F is specified by a set F of the function’s cubes. 
Let G be a function with s inputs and p outputs, and let 77 be a function with 
r-\-p inputs and m outputs. The pair (G, H) represents a serial decomposition of 
F with respect to (A, 5), if for every minterm b relevant to 7^, G(h^) is defined, 
G{h^) e {0,1}^^, and F{h) = H{h^,G{h^)). G and H are called blocks of the 
decomposition. 



X = A UB 




Fig. 1. Schematic representation of serial decomposition. 



[ Theorem 1.] If there exists an r-partition Bq on F such that P[B) < Bg? ond 
F[A) • Bq < TV; then F has a serial decomposition with respect to (A, B) [5]. 



[ Example F] Let’s consider a functional decomposition with A = {x 3 ,^ 4 } and 
B = {xi^X 2 \ for the function of Table 1 specified by the set F of its cubes 
numbered 1 through 10. For these A and B: P{A) ={1; 5; 9; 4; 2, 8; 3; 7; 6, 10}, and 
P{B) = {1,2, 4, 8; TAA; 4; 3/7; 6/7; 7; TOj, and Bq = (1, 2, 4, 5, 6, 7, 8, 9; 3,7,10} 
satisfies the conditions of Theorem 1. 



If k denotes the number of blocks in Bq then the number of outputs from 
block Q is: p = |~log 2 k~\. Outputs of G constitute a part of the input support for 
H. Thus, the size of G and the size of H grow both with the number of blocks in 
partition Bq- The number of blocks in Bq strongly depends on the input support 
chosen for G. Therefore, the decomposition’s quality strongly depends on the 
input support chosen. Most of the known functional decomposition algorithms 
use systematic input support selection, which finds partition Bq for all possible 
input supports [5]. These algorithms are inefficient for large problem instances. 
Therefore, we present below an efficient heuristic method of input selection based 
on information relationships and information relationship measures [3] [4]. 
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3 Information relationships and measures 

Analysis of information relationships between various information streams is of 
primary importance for analysis and synthesis of information systems. The the- 
ory of information relationships and measures is presented in a separate paper of 
this conference [3] and extensively discussed in [4]. Below, only some information 
is recalled that is necessary for understanding of this paper. 

Information on symbols from a certain set S means the ability to distinguish 
certain symbols from some other symbols. An elementsiry information de- 
scribes the ability to distinguish a certain single symbol Si from another single 
symbol Sj, where: Sj G S and Si ^ Sj. Any set of elementary portions of infor- 

mation can be represented by an information relation / or an information 
set IS defined on A x A as follows: 

I = {{si^ Sj)\si is distinguished from sj by the modeled information}, 

IS = {{si^ Sj}\si is distinguished from Sj by the modeled information}. 
Relationships between r-partitions can be analyzed by considering the rela- 
tionships between their corresponding information relations and sets. The cor- 
respondence between r-partitions and /A is as follows: IS contains the pairs of 
symbols that are not contained in any single block of a corresponding r-partition. 

For instance, for input variable X 2 of the function in Table 1 the corresponding 
r-partition and information set are as follows: 

P{X2) = {1,2, 3, 4, 7, 8; 4, 5, 6, 7, 9; 4X10}, 

IS{x2) = {1|5,1|6,1|9,1|10,2|5,2|6,2|9,2|10,3|5,3|6,3|9,3|10,5|8,5|10,6|8, 
6|10,8|9,8|l0,9|10}, 

Symbol in pairs Si\sj from IS{x 2 ) is used to stress that the elements Si and 
Sj of a certain pair {s^, sj} are distinguished from each other. 

In [3] and [4], we defined the following relationships between information 
of two r-partitions Pi and P 2 : 

• common information C/(Pi, P 2 ) = I S{Pi) f) IS{P 2 )y 

• extra information EI{Pi,P2) = IS{P2) \/S'(Pi). 

For an r-partition P the information quantity IQ: IQ[P) = \IS[P)\ is 
defined in [3] and [4]. For two r-partitions Pi and T27 following relationship 
measures are defined in [3] and [4] : 

• information similsirity ISIM: I S I M [Pi ^ P 2 ) = |C/(Pi,T2)|, 

• information increase IINC: IINC{Pi^P2) = |T 7 (Pi, P2)|- 

4 Input support selection 

Each realization of a discrete function, thus also each decomposition, must be 
able to compute information required by the output variables of the function 
from information provided by the input variables of the function. Theorem 1 
describing the conditions for serial decomposition, can be re-expressed in terms 
of the information sets in the following way: 

[ Theorem 2 .] If there exists an r-partition II g on F such that IS[P{B)) D 
IS^IIg)^ (ind IS{P[A) • IIg) 3 IS{Pf )7 then F has a serial decomposition with 
respect to (A, B). 
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In general, information that is necessary for computing values of a certain 
output is distributed on a number of inputs, the inputs also deliver some in- 
formation that is not needed for the output, and information on the inputs is 
represented otherwise than on the considered output. Moreover, decomposition 
is nontrivial if the number of block’s G outputs is smaller than the number of 
its inputs. 

Block G can be described as a block where an intermediate information 
transformation is performed. This transformation consists of construction of an 
appropriate partition II g from a selected partition Partition II g should 

carry a part of the information delivered by partition which in combina- 

tion with information delivered by partition P(M), is essential to compute the 
required output information. The number of block’s G outputs is in practical 
cases much smaller than the number of its inputs. Partition II g is created by 
merging the blocks of partition To avoid big loss of information in this 

process or creation of partitions II g with big number of blocks, which would 
require many physical logic blocks in implementation, partitions generated by 
each of the input variables from set B should be similar each to another. Set B 
should also contain variables that carry relatively much unimportant informa- 
tion. A part of the important information delivered by variables from set B is 
in most cases also delivered by some variables from set A. Therefore it is not 
necessary to transfer this information to the output of G. 

Based on these sorts of observations we developed the following rules of input 
variable selection for set B. Set B should contain variables which: 

• carry relatively much unnecessary information for computing output infor- 
mation, 

• carry relatively much information delivered also by variables from set A, 

• carry quite much common information. 

The above rules are expressed below using the information relationship mea- 
sures introduced earlier in this paper. Variable xi should be included into set B 
if: 

• EliPp^ P{xi)) is relatively big, i.e. IINC^Pp ^P{xi)) is relatively high, 

• Cl\P[A)^ P{^i)) is relatively big, i.e. ISIM{P{A)^ P{^i)) is relatively high, 

• CI{P{B')^ P{xi)) is relatively big, i.e. ISIM{P{B')^ P{xi)) is relatively high, 

where: Xi - candidate to set B^ B' - partially crated set of indirect variables. 

Using these rules, we developed and implemented an algorithm for near optimal 
input support selection in serial decomposition. In the algorithm, set B is con- 
structed step by step. First, a pair of variables {xi^Xj) is chosen that maximizes 
ISIM(x^, Xj). Next such variables are added to B for which IINC(/V,U(x^)) 
and ISIM(U(A),U(x^)) are possibly highest. This increases a chance of con- 
structing of a partition IIg with a small number of blocks without loss of sub- 
stantial information. The algorithm performs a beam-search what allows to con- 
trol both the quality of results and the computation time. 

5 Experimental results 

This section compares the systematic input selection method with the proposed 
heuristic selection method that is based on information relationship measures 
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by, applying both methods to several small, medium and large benchmarks from 
the international logic synthesis benchmark set [9]. Tables 2 and 3 report results 
of the input support selection for set in a single serial decomposition step, 
as illustrated in Fig. 1 and described in Section 2. Table 2 shows comparison of 
the minimum number of blocks of partition Uq- The results were obtained for 
decompositions with 3, 4, 5, and 6 input variables in set B. The method based 
on information measures, despite of its heuristic character, produces the optimal 
or near optimal results. 



Benchmark 


Size 


Systematic method 

(|B|) 


Heuristic method 

(|B|) 


inputs 


outputs 


cubes 


3 


4 


5 


6 


3 


4 


5 


6 
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128 
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8 
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12 
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16 


32 
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16 


32 
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7 


9 
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8 


10 
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1 
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5 


6 


7 
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10 
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12 


24 
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12 


24 
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10 
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4 


6 


9 


11 


4 
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9 


11 


sse 


11 


11 


39 


4 


6 


8 


11 


4 


6 


9 


11 


keyb 


12 


7 


147 


6 


9 


13 


19 


6 


10 


14 


19 


si 


13 


11 


110 


5 


8 


13 


19 


6 


10 


15 


19 


plan 


13 


25 


115 


5 


7 


11 


17 


5 


9 


14 


19 


styr 


14 


15 


140 


4 


6 


9 


13 


5 


7 


10 


13 


exl 


14 


24 


127 


4 


6 


8 


11 


4 


6 


8 


11 


kirk 


16 


10 


304 


4 


4 


5 


6 


4 


5 


5 


6 



Table 2. Comparison of the number of blocks in partition II g obtained by the 
systematic and heuristic method. 



Table 3 shows comparison of the computation time. For large benchmarks, 
the systematic method is very slow in comparison to the new method based on 
information relationships. For functions with more than 10 input variables, the 
new method is many times faster. The difference in processing time between 
these two methods grows very fast with the size of function. For the largest of 
the tested functions the heuristic method is more than 50 times faster. 

Table 4 shows the results of decomposition of some benchmark functions 
into a network of 4-input, 1-output logic cells, obtained by repeating the serial 
decomposition step from Fig. 1 a number of times. The decomposition aims in 
minimal number of cells. In the table the number of logic cells in decomposition 
is presented for systematic and heuristic method. The results show that the 
heuristic character of proposed method has almost no influence on the number 
of logic blocks obtained in decomposition. In two cases (misexl and alu2) the 
results from the heuristic method are even better than from the systematic 
search. This results from the fact that in the systematic method the first found 
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Benchmark 


Size 


Systematic method 

(|B|) 


Heuristic method 
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17 


33 
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6 


9 


19 


42 
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31 


109 
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12 


17 


32 


58 
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14 


24 
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24 


91 
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1377 


8 


12 


26 


58 


kirk 


16 


10 


304 


108 


528 


2125 


11234 


55 


69 


119 


230 



Table 3. Comparison of the computation time (in seconds) for the systematic 
and heuristic method. 



solution with the minimal number of blocks in partition Uq is selected and in 
the heuristic method the solution with the minimal number of blocks which is 
the best from the information measures viewpoint. 

6 Conclusions 

The proposed heuristic method of input support selection is very efficient. The 
method delivers decompositions of similar quality as decompositions obtained 
from the systematic method. In some cases, the results from the heuristic method 
are even better than from the systematic method. For largest of the tested bench- 
marks, the method based on information relationships is more than 50 times 
faster than the systematic method. 



Benchmark 


Size 


Systematic 

method 


Heuristic 

method 


inputs 


outputs 


cubes 


z4 


7 


4 


128 


7 


7 


5xpl 


7 


10 


126 


20 


21 


misexl 


8 


7 


18 


19 


18 


root 


8 


5 


71 


46 


47 


alu2 


10 


3 


391 


116 


114 



Table 4. Comparison of number of logic cells in decomposition. 
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These features make the proposed heuristic method very useful for decompo- 
sition-based analysis and synthesis and demonstrate high usefulness of the infor- 
mation relationships and measures to the analysis and synthesis of information 
systems. 
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Abstract. In the paper we present the notion of rule complex as a 
promising tool to formalize social game systems. We also hope to arouse 
the interest of the computer science community in application of the 
rough-set and other current computing methods to the social game the- 
ory. 



1 Introduction 

Systems of rules guiding social actors in their activities and interactions as well 
as social game orders may be formalized by means of rule complexes in a uniform 
and clear though fairly general way^ . Rule and rule complex are key concepts for 
us. Our rule is a kin to default rule [3] while our rule complex is not merely a set 
of rules. Using the latter notion to formalize social game systems is absolutely 
uoveR. We develop the idea presented in [1] that social organization may be seen 
as a certain system of rules. Our framework essentially extends the classical von 
Neumann-Morgenstern game theory [2] , where a game order is a finite set of pre- 
determined rules. According to our approach, rules may be imprecise and open to 
innovation and modification. Game orders viewed as rule complexes are subject 
to transformation. Social organization and, in particular, social games form a 
system the complexity of which far exceeds the actual capabilities of humans to 
create a uniform formal model with full particulars. This and the difficulties with 
communication between researchers in computer and social sciences discourage 
many to apply the current computing methods in social sciences. Our framework 
is to explain in a formal way what the social game theory is about and to facilitate 
the application of rule-oriented technics, e.g., the rough-set methods to certain 
problems in social game theory, e.g., to generate judgment complexes. 

The author expresses her gratitude to Andrzej Skowron for deep and useful com- 
ments. 

^ For simplicity, the term ”rule complex” will denote both a social system of rules and 
its formal counterpart in our theory. 

^ Informally speaking, the notion of rule complex is related to that of a set of rules 
as the notion of a program with procedures to that of a program with instructions 
only. 
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Briefly speaking, we are mainly concerned with social rule complexes (i.e., 
rule complexes shared by a group or population of actors), actors’ complexes (i.e., 
social roles of actors), and social game orders. Actors use social rule complexes, 
a type of collective knowledge, to constitute and regulate their interactions or 
game processes. Each type of social relationship has its corresponding rule based 
roles derived from cultural frameworks. The roles (i.e., actors’ complexes) vary 
because actors play different roles in social relationships and because actors often 
differ somewhat in the ways they have learned and developed roles through their 
personal histories and continuing practice. Social life is often characterized by 
ambiguity and the underspeciflcation of options, i.e., game information is not 
only incomplete; it remains to be generated. In general, interaction situations 
are typically not fully specifled. The social constructions of games and game 
processes provide the speciflcation. Among other things, the participating actors 
use their social structural knowledge and previous experience with one another, 
possibly in similar game situations to All in undeflned or uncertain ’’spaces” 
of action opportunities and outcomes. Given that social actors tend to acquire 
and develop different rule complexes, we And conflicts and struggles over rule 
complexes. That is, rule complexes - and their articulation in institutions and 
games - are historical products of interactions among social actors. 

2 Rules and Rule Complexes 

meta levels^. The main assumption is that actors organize rules in rule com- 
plexes. In consequence, we can uniformly and relatively easy investigate various 
sorts of rules (e.g., evaluative rules, norms, judgment rules, and action rules), 
complex objects consisting of rules (e.g., roles, routines, procedures, algorithms, 
game orders, and models of the reality and the actors), and interdependencies 
among them. Given a language where the object and meta levels are not sepa- 
rated, let us denote the set of all formulas by FOR^ while rules (resp., rule com- 
plexes) by r (resp., C) with sub/superscripts if necessary. Formally, by a rule we 
mean a triary relation r C V[FOR)‘^ x FOR^ where there exist m,n G such 
that for any (X, T, 7) G r, card{X) = m and card{Y) = n. Where possible, rules 
will be written by schema, viz., r : ^ x = {o;i, . . . ,0;^} 

and Y = {/?i, . . . , /?n}* Axiomatic rules, where A = T = 0 , represent facts. 

A rule complex is a set obtained according to the following formation rules: 
( 1 ) Any Anite set of rules is a rule complex; ( 2 ) If Ci, C2 are rule complexes, 
then Cl U C2, Ci H C2, Ci — C2, and V{C\) are rule complexes; ( 3 ) If Ci C C2 
and C2 is a rule complex, then Ci is a rule complex. Thus, a rule complex C is a 
set C = {ri, . . . , r^. Cl, . . . , C^}, where m, n G IN, ri, . . . , are some rules and 
Cl, ... , Cn are some rule complexes. All rules constituting a rule complex C form 
its rule basCj r6(C). Similarly, rule complexes constituting C form its complex 
basCj c6(C). The meaning of the two notions is explained by the example. 

^ In practice, people carry on both the operative and the meta levels; in part, they 
switch back and forth or operate simultaneously on both levels. Nevertheless, we are 
aware of the circularity which may occur. 
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Example 1 . Where ri,r2,rs,r4 are rules, sets Ci = {r2,rs}, C2 = {^"4}, C3 = 
C4 = {ri, (72,^3} are rule complexes. rb^C^) = {^i, r2, r3, r4} and 
cfe(C4) = {Ci,C2,C3}. 

Example 2 . Algorithms as collections of instructions may be seen as rule com- 
plexes. 

By a subeomplex of C we mean an element of cb[C) or a rule complex obtained 
from C by dropping some elements of rb{C) U cb[C). In particular, any subset 
of C is a subcomplex of C as well. However, C2 and Cq = {ri,C2,C5}, where 
^5 = {^1}; subcomplexes but not subsets of C4. 

The notion of rule complex may be used to formalize game (or interaction) 
orders as well as actors’ systems of rules: normative orders, value complexes, 
judgment systems, action modules, and models of the reality and the actors. 

Example 3 . Given an interaction situation St at time t, consider an actor acting 
as an employee and a mother in Sf. The roles are formalized by rule complexes 
ROLE{t) and ROLM{i)j respectively. The actor’s role in Sfj ROL{t)j may be 
defined as 

ROL{t) = {ROLE{t),ROLM{t),RR{l^)}j 

where RR{t) is a complex of extra rules describing, among others, interdepen- 
dencies between ROLE{t) and Now consider subcomplex ROLE{t) 

which describes the actor’s activities as an employee in St and, in particular, 
norms, values (and hence goals), judgment system(s), action modules (routines, 
procedures, etc.) in At, and her beliefs and knowledge about St (i.e., about 
the reality, herself, and other actors involved) . The norms form a rule complex, 
NOE{t)j called a normative order associated with the role of employee in St- 
Similarly, the evaluative rules form a value eompleXj VALE{t)j the judgment 
rules constitute a judgment eompleXy JUDGE{t)j the action modules form an 
aetion eompleXy ACTE{t)j and the actor’s beliefs and knowledge about St form 
her model of Stj MODE{t). Hence, ROLE{t) may be written as 

ROLE{t) = {NOE{t),VALE{t),JUDGE{t),ACTE{t),MODE{t),RE{t)}^ 

where i^£:(t) is a complex of other relevant rules. Going on along these lines, one 
can obtain specific rules, routines, and procedures associated with subcomplexes 
of ROLE{t); and analogously for A!OLm(^)- On the other hand, if consider par- 
ticular questions like judgment making, one may investigate the actor’s judgment 
complex in Stj JGDG(t), which may be defined as 

JUDG{t) = {JUDGE{t),JUDGM{t),Rj{t)}, 

where Rj{t) is a complex of some extra rules taken into account. Thus, the 
actor’s role ROL{t) may be also written as 

ROL{t) = {NO{t),VAL{t),JUDG{t),AGT{t),MOD{t),R'E{t)}, 

where i^^(t) is a complex of other rules relevant for ROL{t), 
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3 Final Remarks 

For lack of space, our presentation is limited to the definitions of rule and rule 
complex and to a simple example on modeling social roles by means of rule 
complexes. Nevertheless, we have investigated such problems as application of 
rules and rule complexes, consistency of rule complexes, and transformations of 
rule complexes (in particular, compositions and decompositions). 

In spite of its conceptual simplicity, the notion of rule complex is powerful 
and flexible enough to model systems of social actors, games, and interactions. 
Apart from modeling social game systems in a novel way, we aim at building a 
bridge between the computer and social sciences to facilitate application of new 
computing ideas and technics to the contemporary problems investigated in the 
social sciences. 
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[ Abstract.] 

Analysis of relationships between information streams is of pri- 
mary importance for analysis and synthesis of discrete infor- 
mation systems. This paper defines and discusses various in- 
formation relationships and measures for the strength and im- 
portance of the information relationships. 



1 Introduction 

Analysis of information and information relationships is of primary importance 
for analysis and synthesis of discrete information systems. This paper aims to 
introduce and discuss the fundamental apparatus for analysis and evaluation of 
information and information relationships. 

2 Representation of information in information systems 

Information is represented in dis- 
crete systems by values of some dis- 
crete signals or variables. Lets con- 
sider a certain finite set of elements 
S called symbols. Knowing a cer- 
tain value of a certain signal or vari- 
able X, it is possible to distinguish a 
certain subset B of elements from S 
from all other elements of S\ but it 
is impossible to distinguish between 
the elements from B. For example, for Boolean function in Fig.l, where symbols 
1 — 5 represent terms on input variables xi — xe, different values of the input 
variable X3 enable us to distinguish between the subset {1,2, 3} and {4,5}. For 
1, 2 and 3: X3 = 0 and for 4 and 5: X3 = 1. In such a way information is modeled 
with set systems [1]. 

A set system [1] (r-partition [4]) SS on a set S is defined as a collection of 
subsets i^i, B2^ ^Bk of S' such that |J^ = S and Bi ^ Bj for i ^ j. 

An elementary information describes the ability to distinguish a certain single 
symbol si from another single symbol Sj , where: Sj G S and Si ^ Sj. Any set of 



term 

symbols 






inputs 


N 






outputs 
V2 Vs 


V4 


1 


0 


- 


0 


0 


0 


1 


0 


1 


1 


0 


2 


0 


0 


0 


1 


0 


- 


1 


1 


- 


0 


3 


1 


- 


0 


0 


1 


- 


1 


1 


1 


0 


4 


1 


1 


1 


1 


1 


- 


0 


- 


0 


1 


5 


- 


0 


1 


- 


0 


0 


- 


0 


- 


0 



Fig. 1. A multi-output Boolean function 

y = f{xi,X2,X3,X4,Xs,Xe). 
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elementary portions of information can be represented by an information relation 
i, set IS and graph IG defined on S' x S' as follows: 

I = {(s^, Sj)\si is distinguished from Sj by the modeled information}, 

IS = {{si^ Sj}\si is distinguished from sj by the modeled information}, and 
IG = {S, {{s^, Sj}|s^ is distinguished from sj by the modeled information}}. 

Relationships between set systems can be analyzed by considering relation- 
ships between their corresponding information relations, sets and graphs. The 
correspondence between SS and /S' is as follows: IS contains the pairs of symbols 
that are not contained in any single block of a corresponding SS. 

[ Example L] (information modeling with set systems and information sets) 

For the function from Fig.l, the appropriate set systems and information sets 
are listed below: 

5%i) = {{1,2,5}; {3,4,5}} /S'(xi) = {1|3, 1|4, 2|3, 2|4} 

SS{X2) = {{1,2,3,5};{1,3,4}} IS{x2) = {1|4, 2|4, 4|5} 

5%3) = {{1,2,3}; {4,5}} 7S'(x3) = {1|4,1|5,2|4,2|5,3|4,3|5} 

SS{x,) = {{1,3,5}; {2,4,5}} IS{x,) = {1|2, 1|4, 2|3, 3|4} 

SS{xs) = {{1,2,5}; {3,4}} /S'(x5) = {1|3, 1|4,2|3,2|4, 3|5,4|5} 

SS{xe) = {{1,2,3,4};{2,3,4,5}}/S'(x6) = {1|5} 

5%i) = {{1,4,5}; {2,3,5}} 7%i) = {1|2, 1|3,2|4,3|4} 

SS{V2) = {{1,2,3, 4}; {4,5}} 7%2) = {1|5,2|5,3|5} 

SS{ys) = {{1, 2,3,5}; {2,4,5}} IS{ys) = {1|4,3|4} 

SS{y,) = {{1, 2,3,5}; {4}} IS{y,) = {1|4,2|4,3|4,4|5} 

SS{y) = SS{yi,y2,y3,yi) = {{!}; {2,3}; {4}; {5}}, 

7%) = {1|2,1|3,1|4,1|5,2|4,2|5,3|4,3|5,4|5} 



3 Information relationships 

Information relationships between two set systems SSi and SS2 are defined as 
follows: 

• common information Cl (information that is present in both SSi and 

SS 2 ): ci{ssi,ss'2) = is{ssi)nis{ss2) 

• total (combined) information TI (information that is present either in 
SSi or in SS 2 ): 77(S'S'i, S'S' 2 ) = IS{SSi) UlS{SS 2 ) 

• missing information MI (information that is present in SSi, but missing 
in SS 2 ): MI{SSi,SS 2 ) = IS{SSi) \ISXSS 2 ) 

• extra information El (information that is missing in SSi, but present in 
SS2): E 1 {SSuSS 2 ) = IS{SS2) \IS{SSi) 

• different information DI (information present in one and missing in the 
other set system): 7)7(S'S'i , S'S' 2 ) = MI{SS\, SS 2 ) U EI{SSi , SS 2 ) . 

[ Example 2.] (some information relationships for function from Fig.l) 

CI{SS{ye,SS{x4)) = {1|2,3|4} M7(5%i), SS{x4)) = {1|3, 2|4} 
EI{SS{ye,SS{x,)) = {1|4,2|3} DI{SS{yi), SS{x,)) = {1|3, 1|4,2|3,2|4} 
T7(5%i),55(x4)) = {1|2,1|3,1|4,2|3,2|4,3|4} 
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With the relationship apparatus defined above, we can have such questions 
answered as: what information required to compute values of a certain variable 
is present in another variable, what information is missing etc. However, to 
take appropriate decisions, some quantitative relationship measures are often 
necessary. 

4 Information relationship measures 

For set system SS information quantity IQ: IQ[SS) = |/A(AA)| is defined. 
For two set systems SSi and AA 2 , the following information relationship 
measures are defined: 

• similarity measure ISIM: ISIM{SSi , SS2) = |C/(AAi, AA2) | 

• dissimilarity measure IDIS: IDIS{SSi^ SS2) = \DI{SSi ^ SS2)\ 

• decrease measure IDEC: IDEC{SSi, SS2) = \M l{SSi, SS2)\ 

• increase measure IINC: IINC{SSi, SS2) = \EI{SSi, SS2)\ 

• total information quantity TIQ: TIQ{SSi^ SS2) = \TI{SSi^ SS2)\. 

It is also possible to define some relative measures, by normalizing the above 
absolute measures and weighted measures by associating a certain importance 
weight w[si\sj) with each elementary information. 

5 Application of the relationships and measures 

The information relationships and measures introduced above are of primary 
importance for an effective and efficient analysis and synthesis of information 
systems. They provide designers and tools with data necessary for decision mak- 
ing. Results of the relationship analysis make it possible to discover the nature 
of a considered system or to decide its structure. Below, we will illustrate with 
an example the way of use of the information relationships and measures. 

[ Example A] (application to input support minimization) Input support mini- 
mization consists of finding a minimal sub-set of inputs that still enables unam- 
biguous computation of values for a specified function [2] . Let’s find the minimum 
support for the example function from Fig.l, by analyzing the relationships be- 
tween the information required for computing the output values and information 
provided by particular input variables. 

CI{y,xi) = {1|3,1|4,2|4} \CI{y,xi)\ = 3 

CI{y,X 2 ) = {lj4,2j4,4|5} \CI{y,X 2 )\ = 3 

CI{y,x^) = {1|4,1|5,2|4,2|5,3|4,3|5} \CI{y,xs)\ = 6 
CI{y,Xi) = {lj2,lj4,3|4} \CI{y,Xi)\ = 3 

CI{y,xs) = {lj3,lj4,2|4,3|5,4|5} \CI{y,xs)\ = 5 
CI{y,xe) = {m \CI{y,xe)\ = l 

\CI{y,xs)\ > |C'%,X 5 )| > \CI{y,xi)\=\CI{y,X 2 )\=\CI{y,x,)\ > \CI{y,xe)\. 
Thus, xs delivers more information for y than X 5 , x^ more than xi , etc. Further- 
more, X 3 delivers a unique information, namely: 2|5. Therefore, x^ must be in 
any minimal support. The symbols from the pairs given by IS[y) — CI[y^ xQ) = 
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{1|2, 1|3,4|5} are not distinguished from each other by X3, and therefore at least 
one extra input variable is necessary. Because CI[[IS[y)—CI[y^x^))^xi)=(j)^ 
CI{{IS{y)-CI{y, X3 )), X2)={4|5}, CI{{IS{y)-CI{y, xs)), X4)={1|2}, 

C/(y, X3 )), X5)={1|3, 4|5}, CI[y^ XQ)=(j)^ no single variable delivers 
the lacking information, but X 4 and x^ together provide it. This means that X3, 
X 4 and X5 constitute the minimum input support. 

6 Conclusions 

Some of the relationships and measures have already been applied in CAD tools 
for finding optimal decompositions of combinational functions [6] , and for finding 
minimal input support [2]. Application of the relationships and measures to 
input support selection in functional decomposition is presented in a separate 
paper of this conference [5]. Their more extensive discussion can be found in 
[3]. The theory of information relationships and measures introduced briefly in 
this paper makes operational the famous theory of set systems of Hartmanis 
[1]. Set systems enable us to model information. The relationships and measures 
enable us to analyze and measure the modeled information and the relationships 
between the modeled information streams. They form a fundamental apparatus 
for analysis and design of information systems, and in particular for logic design, 
database design, pattern analysis, machine learning etc. [l]-[6]. 
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Abstract. There are many different ways of discovering knowledge in 
large databases but in all of them the same problem arises: how rigorously 
the results of discovery answer the purpose? In the paper we discuss some 
sources of vagueness in data mining, and present non- statistical method 
of inaccuracy evaluation. 



1 Introduction 

Question how rigorously the results of a work answer the purpose, is common 
for all kinds of our activity. It is particularly important in knowledge discovery 
from large databases, because requirements in this domain can be diverse, and 
algorithm efficiency limitations are frequently accepted. 

It will be assumed that the process of discovery starts from user query, which 
contains some information on the domain of search and more or less precise 
specification of the concept in request. Simple query to database may look as 
follows: 

What are the factors that lead to concept(s) in domain^ 

(for instance concept = study prolongation, domain = my department) 
Inductive learning is essential for generating hypotheses from data automat- 
ically, and therefore our interest will be focused on logical descriptions of hy- 
potheses, obtained by induction. 



2 Basic notions 

Relational database collects attribute values of the set of objects. Let X = 
{xi,X2, . . . ^xm} be the set of objects in the problem domain, described by the 
set A = {ai, U2, . . . , ttiv} of attributes. By Vi we denote the domain (value set) of 
the attribute a^ i.e. Vi = • • • , Equation = vji is called selector. 

Possible values space V’ —Vi XV2X , , ,x V^ has elements v named characteristic 
or tuple. Each object Xm has characteristic Vm- Real values space is defined 
by v=uUiV„. Data matrix M = (vi, V2, . . . vm)^ has rows determined by 
objects, and columns corresponding to attributes. 

Attribute values Vi [original values) may have different domains: continuous 
[Vi C 7 ^), ordinal or nominal. These domains are frequently modified, in order to 
reduce its cardinality or with the intention to adapt attribute values to external 
requirements. Some more frequently used attribute modifications are listed below. 



L. Polkowski and A. Skowron (Eds.): RSCTC’98, LNAI 1424, pp. 589-592, 1998. 
@ Springer-Verlag Berlin Heidelberg 1998 




590 W. Traczyk 



Discretization (coding) of continuous values produces discrete values; 
Psirtitioning merges some values of the ordinal attribute into intervals, intro- 
ducing a new form of selectors; e.g. a G or a > t;. 

Assembling selects (by or and not several discrete values, 

Generalization joints similar discrete values into one group. 

Since selectors with modified attributes can include different relations and simple 
values or intervals, we will use generally the symbol Si for all kinds of selectors 
with an attribute a^. 

Two or more original or modified attributes (from the same or distinct tables) 
can be used for construction of virtual attribute a*, with a domain determined by 
operations on domains of component attributes. In the case of logical dependence 
each value t; of a* is defined by a conjunction of selectors (with original or 
modified attributes a^, a^, . . . and relations =, <, . . . ): 

(a* = v) if (a^ = v^) AND {aj = AND . . . 

We will also use the equivalent notation 

Def{a* = v) = Def{s*) = {s^, Sj, . . .} 

It will be useful to extend the definition and assume that virtual attribute 
is a generalization, covering one or more original or modified attributes of the 
given data matrix. 

Concept narne^ introduced in a user query, can be related to the database if 
there exists an virtual attribute with a value that strictly match the meaning 
of a concept. We will use symbols a for a concept name, - for corresponding 
attribute value and a^ - for the virtual attribute with value (a^ = a*). 

Intension Int(a) of a concept a is a definition of appropriate virtual attribute 
and may be used as a definition of a concept (for logical dependencies): 
lnt{a) = Def{a^ = v^) 

Extension Ext(a) (or range X[a)) of a concept a contains all objects from 
X for which an attribute a^ has value v^\ Ext{a) = X{a) = {x \ a^{x) = 

Attributes from an intension of a concept define it, but frequently some 
other attributes describe the causes or conditions of a concept appearance. 
Then, for a given concept a the set of all original or modified attributes can 
be divided into three parts: defining attributes Afi (represented by a^), ex- 
plaining (interpretative) attributes and irrelevant attributes Aj, such that 
A = Afi U A^ U Af. For example the concept study prolongation is defined 
by the attribute semester-of-the-study (modified by coding "long” study), 
explained by attributes external- JOB, average- grade, . . . , but attributes 
like NAME, ID-NUMBER are irrelevant 

Defining, explaining and irrelevant attributes determine appropriate partition 
of object characteristic: will call them D-, F- and 

I- characteristics of an object Xm- FA 

W = UAiVDm, V£ = U^=iV£^ and p-.X^Ye- 

All objects from X represented by E-characteristic ye are then defined by 
-^(V£|) = {x\p{x) = Ve} 

In many cases the main goal of data mining is to find the dependency be- 
tween defining attributes (as dependent variables) and explaining or remaining 
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attributes (as independent variables). Explanation of a concept a, represented 
by virtual attribute a* = with discrete value, can be stated as a predicate 
formula <P^ with selectors used as arguments . . . E A^): 

Expl{a"^ = v) = Expl{s'^) = ^{sk^ si ^ . . .) 

This equation is usually presented as a production rule. Explanations generated 
in this way designate the answer to the user query, but frequently a query con- 
cerns not only one concept but a set of similar concepts, exhaustive and disjoint, 
which form classes of the domain: 

a; = x{ai) u a; ( 0-2) u . . . yjx{a^) 

In this case explanation formulae are used as a tool for classification. 

3 Approximations 

In a process of reply generation for the user query we should be able to evaluate 
the quality of result, mainly determined by univocal and precise final explana- 
tions or interpretations. Precise concept (s) explanation (or interpretation) by 
means of logical formula assumes univocal relationship between values of ex- 
plaining attributes and values of defining attributes This is equivalent 
to the assumption that the set of objects relevant to a (i.g. A^(cr)) covers all 
objects represented by E-characteristics explaining a concept a (i.g. A(v£;)). 
Appropriate formula (valid for all a) looks as follows: 

(A(v^) n A(a) ^ 0) ^ (A(v^) C A(a)). 

If reality is not so ideal - some characteristics ve refer not only to considered 
concept but also to another notions (then explanation of the concept is not 
correct), or objects with equal characteristics ve are associated with different 
concepts (giving wrong classification, if based on v^;). If we are interested in the 
degree of ambiguity - rough sets [3] can help to evaluate it. 

E-characteristics that always properly explain the concept - define the set 
of objects being lower approximation of the concept (exactly - of its extension), 
calculated as: 

X^ = {Xm\X{vEm)^X{a)} 

E-characteristics that explain the concept, but can effect also some other notions, 
define the set of objects being upper approximation of the concept: 

= {Xm\X{w Em) r^X{a)^(b} 

A quality of interpretation may now be defined as the ratio of uni vocally 
interpreted objects to all objects referred to the concept: 

In the ideal case Xa = X^ = X{a) and 77 = 1. 
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For complex explanations it is important to know that some set of E-charac- 
teristics marks certain hut incomplete explanation: 

V^ = {vb|X(v£) CX(a)} 

and another set defines redundant explanation: 

V^ = {v£|X(v£)nX(a)/0} 

When more than one concept should be interpreted by data mining, eval- 
uation of interpretation can be done by means of the evidence theory. In this 
theory, the basic prohahility assignment m : 6^ ^ [0, 1] is interpreted as the degree 
of evidence that a specific element of X belongs to the set of concepts 0 (subset 
of all concepts considered), but not to any special subset of 6. 

Notions of lower and upper approximations can be extended to more than 
one concept in the set 0: 

Xe=[^X^ = U 

a^9 a^O 

Having sets Xq one can easily calculate rn{0) from the equation 

m{0) = ^ 

If 0 contains one, two or more concepts and A is the set of concepts, then the 
belief function is defined as 

Bel{A) = 

and the plausibility function - as 

Pl{A) = E(?nzi^0 

Bel [A) represents the total evidence or believe that the object from X belongs to 
A as well as to the various subsets of A. PI {A) represents all this and additional 
evidence associated with sets that overlap with A. 

The pair {Bel[A)^ Pl{A)) for different A constitutes a measure of approxima- 
tion for the set A: if Bel [A) = Pl{A) - concepts in A are interpreted precisely 
(as a set), the difference between Bel [A) and PI {A) shows the level of approxi- 
mation, and values of Bel[A) and Pl{A) qualify contribution of A members in 
the total set of objects. 
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[ Abstract.] We describe how probabilistic rough classifiers, 
generated by the rule induction system ProbRough, were used 
for purchase prediction and discovering knowledge on customer 
behavior patterns. The decision rules were induced from the 
mail-order company database. Construction of ProbRough is 
based on the idea of the attribute space partition and was in- 
spired by the rough set theory. The system’s beam search strat- 
egy in a space of models is guided by the global cost criterion. 
The system accepts noisy and inconsistent data with missing 
attribute values. Background knowledge is used in the form of 
prior probabilities of decisions and different costs of misclassi- 
fication. ProbRough provided a lot of useful information about 
the problem of customer response modeling, and demonstrated 
its usefulness and efficiency as a data mining tool. 



Introduction 

Predicting future purchase behavior at the level of the individual consumer has 
become a key issue in database marketing, which is defined as a method of an- 
alyzing customer data to look for patterns among existing customer preferences 
and to use these patterns for more targeted selection of customers (Fayyad et 
ah; 1996). The prediction and targeting of the individual consumer was made 
possible by the capability to individually address every single customer by direct 
marketing media such as direct mail, catalogs, and more recently the internet 
(Petrison et ah; 1993). This is in contrast to the traditional advertising media 
such as print and television, which do not offer this opportunity. In the paper we 
describe how probabilistic rough classifiers, generated by the ProbRough system 
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(P last a and Lenarcik; 1996, 1998) were used for purchase prediction, discover- 
ing knowledge on customer behavior patterns, and presenting this knowledge 
in the form transparent to the user. ProbRough is a system based on the idea 
of the attribute space partition. Its construction was inspired by the rough set 
theory (Pawlak, 1991). ProbRough beam search strategy in a space of models 
is guided by the global cost criterion. The system accepts noisy and inconsis- 
tent data with missing attribute values. Background knowledge is used in the 
form of prior probabilities of decisions and different costs of misclassification. 
ProbRough provided a lot of useful information about the problem of customer 
response modeling, and demonstrated its usefulnes and efficiency as a data min- 
ing tool (see also, Kowalczyk and Piasta, 1998; Lenarcik and Piasta, 1997). This 
paper is structured as follows: in the next section we describe the process of 
knowledge discovery, including a detailed description of the business problem, 
the data mining method, characterization of the data and a discussion of the 
results. We conclude with a summary of our findings. 



The KDD process 

Business problem 

Every mailing period, mail-order companies are confronted with the decision 
problem whether or not they have to mail a catalog to a particular customer. 
The importance of this task is enhanced by the tendency of rising mailing costs 
and increasing competition (Hauser, 1992). In this case study, we formulated this 
decision problem as a classification task, i.e., we tried to predict, based on all 
available data, whether a customer would (re) purchase during the next mailing 
period. Hence, the response variable in the specification which we investigated 
was binary (0/1). The marketing manager would then typically use the prediction 
of the classification technique to rank the total customer base and mail to the 
best part of its customer base. How many customers would receive a particular 
catalog is determined by the mailing company based on the cost /profit trade- 
off or budget constraints. Two important issues in this decision problem arose, 
which have not been dealt with extensively by previous discussions of this topic 
of response modeling: (1) the prior probabilities and (2) the misclassification 
costs. The former issue is concerned with the fact that frequently the proportion 
of buyers to non-buyers in the data do not reflect the true proportion in the 
whole population of customers. The latter problem deals with the fact that the 
cost of incorrectly assigning a buyer as a non-buyer (i.e., the opportunity cost 
of missing a sale) is much higher than the situation in which a non-buyer is 
predicted to be a buyer (i.e., the cost of sending a catalog to somebody who will 
not purchase). 



Data Mining Method: ProbRough 

We selected the ProbRough system for generating probabilistic rough classifiers 
as a data mining tool. ProbRough has the ability to handle the key features of the 
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problem at hand: (1) prior probabilities and (2) unequal misclassification costs. 
In the sequel by standard priors and costs we mean the prior probabilities set to 
the frequencies of classes in the training data and equal costs of misclassification. 
Now, we present a general idea of ProbRough. The detailed description of the 
algorithm is given in Piasta and Ten arc ik (1996). 

Domains of the decision rules generated by ProbRough are of the form 

^ ... X ^777, , 

where is an interval, when the values of the g-th attribute are ordered, or an 
arbitrary subset, otherwise. Such Aq-s are referred to as the segments. The sets 
A are called the feasible subsets in the attribute space. Because of the specific 
form of the domains, the decision rules can be expressed in a transparent format 
that can be easily understood by users. 

The algorithm of rough classifier generation consists of the two fundamental 
phases: 

— [(i)] the global segmentation of the attribute space, 

— [(ii)] the reduction of the number of decision rules. 

In the first phase ProbRough tries to minimize the average global cost of the 
decision-making. In this phase a number of divisions of the whole attribute space 
is peformed. This number is one of the parameters and is referred to as the num- 
ber of iterations of the algorithm. The number of iterations can be optimized 
in order to improve the predictive properties of the algorithm (see, Piasta and 
Lenarcik, 1996). Each single iteration consists in dividing one of the attribute 
value sets (or its subset) into two segments. When the attribute is continuous 
then a finite set of intermediate values has to be used. This set should be given 
in advance or can be obtained from the data. The attributes which are not in- 
volved in the partition process are eliminated. The resulting partitions determine 
the unique partition of the attribute space into feasible subsets. Each partition 
element is associated with a set of equally important decisions. In the second 
phase we consider only those partitions which are associated with the minimum 
value of the cost criterion. In this phase the partition elements are joined into 
the bigger feasible subsets, provided that the sets of equally important decisions 
have a non-empty intersection. During the above procedure the number of rules 
of the resultant classifier is minimized. 

We refer to Van den Poel (1998) for a comparison study between several 
different data mining methods for response modeling. 



Dataset 

A random sample from the mail-order company database from an anonymous 
European mail-order company containing 6.800 observations was taken in such 
a way that 50 % of the customers in the sample responded to the offers during 
a 6 month period and 50 % did not respond to the offer. This sample was 
randomly split to obtain a learning sample and a test sample. All models were 
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built on the basis of the learning sample. The predictive performance was judged 
on the basis of the test sample, i.e. observations which the system had not used 
during the learning phase. Among the many attributes that could be constructed 
based on past transaction data, three variables have been identified by Cullinan 
(Petrison et ah; 1993) to be of particular importance in database marketing 
modeling: recency, frequency and monetary value. Several instances of these 
RFM-variables were included in this study as shown in Table 1. All variables 
were already categorized and were provided by the mail-order company at the 
level of the individual customer. 



Table 1. Description of the attributes in the database. 



Name 


'i’ype 


Buyt-i 


0/1 


Buyt -2 


0/1 


Buyt-3 


0/1 


Buyt-4 


0/1 


Customer 


6 cat. 


LastPreq 


9 cat. 


Last Sales 


5 cat. 


Last Profit 


10 cat. 


DaysSince 


7 cat. 


Unimulti 


3 cat. 


Socclass 


6 cat. 


VAT 


0/1 


Household 4 cat. 


Family 


4 cat. 


Nat class 


6 cat. 


State 


9 cat. 



Description 

Did the customer buy during the previous 6 months? 

Did the customer buy in the period: 1 year ago - 6 months ago? 
Did the customer buy during the period: 1.5 years - 1 year ago? 
Did the customer buy during the period: 2 years - 1.5 years ago? 
For how long is this person a customer? 

What was the purchasing frequency during the last 6 months? 
Sales generated by the customer during the last 6 months? 

Profit generated by the customer during the last 6 months? 

No. days since the last purchase 

Does the customer live in a stand-alone home or an appartment? 
Social class of the customer 
Is the person self-employed? 

Type of household the customer belongs to 
Number of families living at the address 
Nationality distribution in the street of the customer 
Province of Belgium the customer lives in 



Results 

This section discusses the results which were obtained from applying the Prob- 
Rough system to the problem at hand. We present the resulting rough classifiers 
obtained with the standard priors and costs of misclassification, together with 
two alternative results with non-standard priors and costs of misclassification, 
which were typical for the mail-order company. 



Standard priors and costs of misclassification. Table 2 contains three 
equivalent sets of decision rules, created by the ProbRough system for the case 
of equal prior probabilities of purchase and no purchase, and equal costs of 
misclassification. A decision for every (new) customer could be taken based on 
each of the three rulesets. The classification of cases to either the Do mail or Do 
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not mail category was then complemented by the strengths of each of the rules. 
The latter criterion was used to assign an order among the segments. 



Table 2. Results with standard priors and costs of misclassification 



1. if LastFreq > 0 

2. if LastFreq = 0 and Buyt -2 

3. if LastFreq = 0 and Buyt -2 

4. if LastFreq = 0 and Buyt -2 



then d = Domail^ 
Yes then d = Domail^ 

No and DaysSinee < 180 then d — Domail^ 

No and DaysSinee > 180 then d — Do not mail. 



1. if Buyt-i =Yes 

2. if Buyt-i — No and Buyt -2 — Yes 

3. if Buyt-i — No and Buyt -2 = No and DaysSinee 

4. if Buyt-i = No and Buyt -2 = No and DaysSinee 



then d = Domail^ 
then d — Domail^ 

< 180 then d — Domail^ 

> 180 then d = Do not mail. 



1. if LastProfit > 0 

2. if LastProfit = 0 and Buyt -2 

3. if LastProfit = 0 and Buyt -2 

4. if LastProfit = 0 and Buyt -2 



then d — Domail^ 
Yes then d = Do mail ^ 

No and DaysSinee < 180 then d = Domail^ 

No and DaysSinee > 180 then d = Do not mail. 



We observed that the structure of the three sets of decision rules was very 
similar even though different attributes were used. These predictors were close 
substitutes because the correlations between LastFreq and Buyt-i (0.80) and 
LastFreq and LastProfit (0.91) were high. This caused the results in Table 2 to 
be very closely together. The decision rules which we obtained closely reflected 
database marketing theory that recency (attribute DaysSinee), frequency (Last- 
Freq, Buyt-iy Buyt- 2 ) and monetary value (LastProfit) are the best predictors 
for future purchasing behavior, even though these measures are known to be 
highly intercorrelated. The findings also show that only very recent transac- 
tional data (up to one year before the period of consideration) were useful for 
predictive purposes. 



Table 3. Classification results (standard priors and costs of misclassification). 





Ruleset 


1 Ruleset 2 Ruleset 3 


Prob(Decision=Z)c mail if Reality=no purchase) 


0.25 


0.22 


0.25 


Prob(Decision=Do not mail if Reality=no purchase) 


0.75 


0.78 


0.75 


Prob(Decision=Do mail if Reality=purchase) 


0.74 


0.72 


0.74 


Prob(Decision=Do not mail if Reality=purchase) 


0.26 


0.28 


0.26 
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Table 3 contains the results of the use of the rough classifiers shown in Table 2 
on the test sample (i.e., cases which have not been submitted to the data mining 
method before). The classification results exhibited a favorable picture (about 
3/4th of the cases were correctly classified and neither of the two decisions were 
favored) and revealed that all three rulesets performed very similarly, which 
showed that they were also in terms of a test sample truly substitutable. These 
results were similar to the findings by Van den Poel (1998). 



Unequal priors and costs of misclassification 2:1. In database market- 
ing applications, misclassification costs are typically not equal. Usually, costs of 
classifying a buyer as a non-buyer are much higher than classifying a non-buyer 
as a buyer, because the cost in the latter case is often limited to the cost re- 
lated to the mailing piece, whereas in the former case foregone profits are much 
more important. Therefore, we reran the ProbRough algorithm with a ratio of 
misclassification costs 2:1. Moreover, we adjusted the prior probabilities to 0.4 
for the probability of a purchase and hence, 1 - 0.4 or 0.6 for the probability of 
no purchase. The search process of ProbRough is directed now by the average 
cost of misclassification that includes unequal priors and costs of misclassifica- 
tion. The problems of unequal priors and unequal costs of misclassification are 
related (see, Piasta and Lenarcik, 1998). The ratio of misclassification costs 2:1 
is partly compensated by the assumed prior probabilities. 

When we compared the generated rough classifiers to those in Table 2, the 
only difference was in the value for the attribute of the number of days since 
the last purchase. We could therefore conclude that the change in the costs of 
misclassification led to an increase in the period of consideration (value of 180 
days in Table 2 versus a value of 365 days now). 



Unequal priors and costs of misclassification 3:1. Now we present results 
obtained when we increased the ratio of misclassification costs from 2:1 to 3:1. 
The purpose was to investigate the impact on both the types of attributes used 
and the classification results. The use of the induced rough classifier shown in 
Table 4) on the test sample exhibited a decrease in the probability of wrongfully 
assigning a buyer to the non-buyer category (as compared to Table 3) and an 
increase in the probability of incorrectly assigning a non-buyer to the buyer 
category. This is a consequence of the fact that the search process in the space 
of models is guided by the criterion which depends on the assumed ratio of 
misclassification costs. 

From a marketing perspective, it was remarkable that a geographic element 
became an important segmentation variable in the prediction of specific con- 
sumer behavior. However, this finding was confirmed by the mail-order company 
managers, who pointed to substantial differences in consumer behavior between 
different regions of Belgium. Moreover, the number of families living at a certain 
address became important in the decision whether or not to mail. Nevertheless, 
we have to mention that the strength of rules containing these additional non- 
RFM variables is not all that large. With respect to the period of consideration 
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Table 4. Results with unequal prior probabilities and costs of misclassification 

3:1 



1. if LastFreq > 0 & State 
then d = Domail^ 

2. if LastFreq > 0 & State 
then d = Do not mail^ 

3. if LastFreq > 0 & State 
then d = Do mail, 

4. if LastFreq = 0 & State 
then d = Do mail. 

5. if LastFreq = 0 & State 
then d = Do mail. 

6. if LastFreq = 0 & 
then d = Do not mail. 

7. if LastFreq = 0 & State 
then d = Do mail. 

8. if LastFreq = 0 & State 
then d = Do not mail. 

9. if LastFreq = 0 & State 
then d = Do not mail. 

10. if LastFreq = 0 & State 
then d = Do mail. 



0 {4,5} 
e {4,5} 
e {4,5} 
0 {4,5} 
0 {4,5} 

e {4,5} 
e {4,5} 
e {4,5} 
e {4,5} 



& 

& 

& Buyt-2 
& Buyt-2 
Buyt-2 
& Buyt-2 
Buyt-2 
& Buyt-2 
& Buyt-2 



Family > 2 
Family < 2 

Yes 

No 8z DaysSinee < 365 

No Sz DaysSinee > 365 

Yes Sz Family < 2 
Yes Sz Family > 2 

No Sz Family > 2 & DaysSinee < 365 

Wo & Family < 2 Sz DaysSinee < 365 



of the RFM variables, the same duration of one year was discovered to be most 
important as for the ratio of misclassifications 2:1. 



Conclusions and future research 



The results obtained from the application of the ProbRough algorithm redis- 
covered the RFM variables (known from theory) as most significant predictors 
of future mail-order buying behavior. However, the data mining method high- 
lighted the fact that only rather recent transactional data (up to one year before 
the period of consideration) was useful in the prediction of future purchase be- 
havior. A change in the costs of misclassification was reflected in an increase 
of the period of consideration (to a one- year period). A significant finding was 
the importance of non-RFM variables in predicting purchasing behavior as the 
ratio of misclassification costs became larger. This corresponded to prior beliefs 
of marketing managers but has not yet been revealed by other data mining ef- 
forts. Moreover, the application of a data mining method which was capable of 
handling misclassification costs clearly showed a decrease in the overall cost due 
to a lower percentage in foregone profits. We used predictors which had already 
been categorized. This is not a necessity for the ProbRough algorithm, since 
it features a built-in discretization process. However, we could not investigate 
this feature because only categorical data was available. Therefore, we leave this 
issue as a topic for future research. 
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Abstract. The paper presents an attempt to apply the Rough Sets The- 
ory to Optical Character Recognition. In this approach specific charac- 
ters’ features are referred to as an information system, from which the 
most important information is being extracted by the Rough Sets Theory. 
This process is fully automatic and does not require any human decision 
in the area of usefulness of certain characters’ features. A discernibility 
matrix which is built in this way constitutes a reduced database for clas- 
sification algorithms. A brief description of Classical Optical Character 
Recognition Theory and Rough Sets Theory as well as some selected 
research and experimental results are also presented. 



1 Introduction 

Automatic identification of characters is getting more and more important in 
our modern civilization (e.g. recognition of addresses, zip codes, signatures). A 
recognition algorithm must be based on a certain, earlier acquired knowledge 
on the objects it is about to identify. The more objects there are and the more 
complex they are, the bigger the knowledge base of the algorithm is. Therefore 
it seems important to find an efficient way of data reduction, providing the same 
(or lower but acceptable) quality of identification. The solution to this problem 
may be the Rough Sets described in [1]. 

2 Classical Optical Character Recognition Theory 

Optical Character Recognition (OCR) is a field of science dealing with pattern 
analysis, especially with identification of characters. It is a very complex task. 
This research is focused on the last stage of character recognition in which sep- 
arated characters are treated as objects to be identified. This stage consists of 
pre-processing, feature extraction and classification. 

The pre-processing prepares a character in such a way that a representa- 
tive feature vector may be extracted. In the classification stage a classifier (e.g. 
classical, statistical, fuzzy logic or neural network) employing certain knowledge 
and basing on the extracted feature vector assigns the character to a particular 
class. The knowledge base of the classifier comprises formerly generated feature 
vectors of characters from the learning set, which membership classes are known. 

In order to recognize a character correctly, the closest vector to the character’s 
vector from among learning set vectors must be found (minimal distance classifier 



L. Polkowski and A. Skowron (Eds.): RSCTC’98, LNAI 1424, pp. 601-604, 1998. 
@ Springer-Verlag Berlin Heidelberg 1998 




602 W. Czajewski 



- used mostly in the research). Although this method is relatively simple and 
fast, it may cause mistakes and thus often so called k-nearest neighbor classifier 
is used. It improves the recognition results by a few percentage points, but the 
general tendency stays the same. 

3 Elements of the Rough Sets Theory 

Due to the increasing size of databases and huge amount of information stored in 
them, there occurred a necessity of developing efficient methods of acquiring the 
most essential and useful knowledge out of databases. This problem was among 
the most important ones in the field of modern information systems. In 1982 Z. 
Pawlak proposed in [1] a new theory of reasoning about data called the Rough 
Sets Theory. 

3.1 Decision Tables 

In this approach knowledge is represented by a set of data organized in a table 
called an information system. Rows of the table refer to objects (e.g. characters), 
and columns to their attributes (features). Reduction of knowledge means delet- 
ing unnecessary and superfluous attributes and leaving only the most important 
ones. 

Let DT be a consistent decision table (one without objects with identical 
condition attributes and different decision attributes ^ ) , C be a set of condition 
attributes, D be a set of decision attributes and o; G C. An attribute a is dispens- 
able in the decision table DT if the decision table DT is also consistent without 
the attribute a; otherwise the attribute a is indispensable in the decision table 
DT. A decision table DT is called independent if all the attributes o; G C are 
indispensable in the decision table DT. The subset of attributes R C C is called 
a reduct of C in the decision table DT if the new decision table (U,R U D,V,f) 
is consistent and independent. 

Any subset satisfying the above except for being independent also defines 
the knowledge but is superfluous. These subsets (later referred to as dependent 
subsets) are of great importance in the conducted research. 

3.2 Discernibility Matrix 

The process of verification if a given subset is a reduct or dependent subset is 
a process involving a huge number of comparisons and therefore is very time- 
consuming. 

Notions of discernibility matrix and function introduced by Rauszer and 
Skowron in [2] make the search time much shorter. The discernibility matrix 
stores only information whether the corresponding attributes of any pair of ob- 
jects are different or not. This information can be stored on single bits grouped 
in bytes. Hence, there is much less data to be compared than previously and the 
processing time is shorter. 

^ in other words: one with identical attributes belonging to different classes 
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4 Research and Conclusions 

4.1 Reducts 

All the tests were performed on one set of objects: printed numbers (0-9) based 
on MS Windows true-type fonts. Each number was stored on a monochrome 
48x48 pixels bitmap. On the whole 2330 characters were tested, 233 of each 
kind. 155 characters belonged to the teaching set, the other 78 to the testing set. 

At first a few types of feature vectors were tested and the best one was estab- 
lished. It comprised 64 integer features. An information system then built was 
being reduced repeatedly. Initially one 8-element reduct was found. The length 
of this reduct was only 12,5feature vector’s length. According to the reduct’s 
definition it allowed perfect recognition of all the objects from the learning set. 
The testing set, however, was recognized with much lower quality, because the 
reduct was optimally chosen for the known objects from the learning set and 
no others. After further examinations the lowest (59,7%), the highest (72,4%) 
and the average (66,6%) recognition quality for 8-element reducts were found. 
These results clearly indicate that the reduct is unsuitable for the recognition of 
characters from outside the learning set. The classification quality in this case 
is much lower (ca. 30 percent points on the average) than when all the features 
are used. The only method of increasing the recognition quality was utilization 
of other(longer)dependent subsets. 



4.2 Dependent Subsets 

Due to unimaginably big number of all possible subsets of the whole attribute set 
of the information system (2®^ ^ 1^8 10^®) verifying every subset is virtually 

impossible. Therefore a random search algorithm was used. Having tested thou- 
sands of subsets some conclusions of statistical nature were reached. As one can 
notice (Table.l), some of the dependent subsets give better recognition results 
than the full feature vector! 

Table 1. Recognition results of the testing set for dependent subsets of a given length 



Length of subset 


13 


19 


26 


32 


38 


45 


51 


58 


64 


Minimal quality 


70,5 


71,5 


79,5 


83,6 


86,8 


87,4 


89,7 


91,3 


93,3 


Maximal quality 


70,5 


71,5 


79,5 


83,6 


86,8 


87,4 


89,7 


91,3 


93,3 


Average quality 


72,5 


80,1 


85,3 


87,8 


89,9 


91,1 


92,2 


92,8 


93,3 



They should be regarded as the sets of features with the least important 
features eliminated. Unfortunately there is no analytical method of finding them. 
In practical applications one must rely only on statistical results (much less 
optimistic) . 

As a matter of fact over 27% of 58-element dependent subsets give better 
recognition results than the full feature vector, but this number decreases quickly 
as the vector’s length drops (it amounts 5% for 51- element subsets and less for 
the remaining ones). 
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The shortest dependent subset found giving recognition results not worse 
than the full feature vector had 38 elements (ca. 60% of the full vector’s length). 
Graphical interpretation of this subset is shown in Fig. la. For comparison, in 
Fig. lb the worst dependent subset found of the same length is shown. The two 
subsets have 14 different elements. 





Fig. 1. The distribution of features for a) the best and b) the worst 38-element reduct 
(white squares represent the features that were neglected in the process of recognition) 



4.3 Conclusions 

On the ground of the conducted research it can be stated that: 

— it is enough to use 5-15% of the knowledge from an information system in 
order to recognize all the characters from this system perfectly. In other 
words, up to 95% of data in an information system is dispensable. 

— it is possible to find a reduct that will give the same recognition results for 
characters from outside the information system as the full length vector. In 
this case ca. 40-50% of data in an information system is dispensable. 

Generally one can notice that the more features a single character has the shorter 
(in percent, not absolutely) the minimal reduct is. Also less (in percent, not 
absolutely) data is required to provide the same recognition quality for characters 
outside the information system. The obtained results are quite interesting. They 
mean substantial (even 10 times in the first case) reduction of recognition time 
and hardware requirements, which leads directly to economical savings. 
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[ Abstract.] This paper briefly describes ROSE software pack- 
age. It is an interactive, modular system designed for analysis 
and knowledge discovery based on rough set theory in 32-bit 
operating systems on PC computers. It implements classical 
rough set theory as well as its extension based on variable pre- 
cision model. It includes generation of decision rules for classi- 
fication systems and knowledge discovery. 



1 Introduction 

ROSE (Rough Set Data Explorer) is a modular software system implementing 
basic elements of the rough set theory and rule discovery techniques. It has 
been created at the Laboratory of Intelligent Decision Support Systems of the 
Institute of Computing Science in Poznan, basing on fourteen- year experience 
in rough set based knowledge discovery and decision analysis. 

All computations are based on rough set fundamentals introduced by Z. 
Pawlak [6]. One of implemented extensions applies the variable precision rough 
set model defined by W. Ziarko [14]. It is particularly useful in analysis of data 
sets with large boundary regions. 

The ROSE system is a successor of RoughDAS and RoughClass systems 
[3] [5] [10]. RoughDAS is historically one of the first successful implementations 
of the rough set theory, which has been used in many real life applications. 
Due to limitations of RoughDAS, especially its incapability to make full use of 
currently available computers, there was a need to design and implement new 
software. 

ROSE started as several independent modules that were later put together 
in one system. Eirst we were motivated to create computational engine working 
on more powerful computers (e.g. UNIX workstations), allowing faster analysis 
of large data sets. Then we came to the point of creating user friendly interface, 
where Microsoft Windows was chosen as our basic platform. So the modules 
can be separately redesigned and recompiled without much interference from 
user’s point of view. The only component that is strictly platform dependent is 
graphical user interface (GUI). All this guarantees that the system can be easily 
adapted for future operating systems and platforms. 
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2 ROSE system 



The program ROSE is an interactive software system running under 32-bit GUI 
operating systems (Windows 95/NT 4.0) on PC compatible machines. The core 
modules were written in C++ programming language (standard ANSI), while 
the interface modules were developed using Borland C++ (with Object Windows 
libraries) and Borland Delphi. 

The system consists of a graphical user interface (GUI) and a set of separate 
computational modules. The modules are platform independent and can be re- 
compiled for different targets including UNIX machines. GUI acts as an overlay 
on all computational modules. So it is quite easy to add new modules to the 
ROSE system and that is an important characteristic. This guarantees greater 
expandability of the system in the future. 

ROSE is designed to be easy in use, point and click, menu-driven, user 
friendly tool for exploration and data analysis. It is meant as well for experts 
as for occasional users who want to do the data analysis. System communi- 
cates with users using dialog windows and all the results are represented in the 
environment. Data can be edited using spreadsheet like interface. 



3 Input/output data 

ROSE accepts input data in form of a table called an information table in which 
rows correspond to objects (cases, observations, etc.) and columns correspond 
to attributes (features, characteristics). The attributes are divided into disjoint 
sets of condition attributes (e.g. results of particular tests or experiments) and 
decision attributes (expressing the partition of objects into classes, i.e. their 
classification). The data is stored in a plain text file according to a defined 
syntax (Information System Eile, ISE). ROSE can also import data stored by 
its predecessor RoughDAS and export to several other formats (including files 
accepted by the system LERS or C4.5). 

ISE file specification allows for long attribute names (up to 30 alphanumer- 
ical characters) and string values of attributes (such as ’high’, ’low’) aside real 
and integer values. Because it is plain text file it can be transferred between dif- 
ferent operating systems without modifications. It is also easy to edit and verify 
correctness of data contained in the file. 

The file format has an open structure. It is divided into sections and it is 
possible to add some new sections so far undefined for further use. The user 
can decide to ignore some of the attributes just by changing the qualification of 
attributes. 

Except visualization in GUI, all results are also written to plain text files, 
so they are readable also outside the system, and easily converted to other file 
formats. 
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4 Features 

Features currently offered by computational modules include: 

— data validation and preprocessing, 

— automatic discretization of continuously- valued attributes according to Fay- 
yad & Irani method [1] as well as user-driven discretization, 

— qualitative estimation of the ability of the condition attributes to approx- 
imate the objects’ classification, using either standard rough set model or 
variable precision model extension, 

— finding the core of attributes as well as looking for reducts in the information 
table (either all reducts or a population of reducts of predetermined size) 
using several methods such as algorithm by S. Romanski[7] and modified 
algorithm by A. Skowron [8], 

— examining the relative significance of a given attribute for the classification 
of objects, by observing the changes in the quality of classification, 

— reducing superfluous attributes and selecting the most significant attributes 
for the classification of objects; there are available several techniques that 
support the choice o subsets of attributes ensuring a satisfactory quality 
of the classification (e.g. the technique of adding the most discriminatory 
attributes to the core), 

— inducing decision rules using either the LEM2 algorithm [2] or the Explore 
algorithm [4] [13], 

— postprocessing of induced rules, e.g. pruning; looking for interesting rules 
according to the user defined queries [4] , 

— applying the decision rules to classify new objects using different techniques 
of rule matching, in particular an original approach based on valued closeness 
relation [3] [9] , 

— evaluation of the sets of decision rules using k-fold cross validation tech- 
niques. 

It will be quite easy to add new modules to the system due to its open 
architecture. 

5 Final Remarks 

In the near future we plan to add new capabilities to our system, such as: incre- 
mental reduct generation, incremental rule generation, working with incomplete 
information tables, working with similarity relations for rough approximations, 
working with dominance relations for rough approximation of multicriteria clas- 
sification problems, working with dominance relations and pairwise comparison 
tables for rough approximation of multicriteria choice and ranking problems. 
These functionalities are based on recent research results of the team members. 

The ROSE system and its predecessor RoughDAS have been applied to many 
real-life data sets. The references to these applications are given, e.g. in [5]. 
Some of the main fields of applications include: medicine, pharmacy, technical 
diagnostics, finance and management science, image and signal analysis, geology, 
software project evaluation. 
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Abstract. The paper describes an application of soft computing meth- 
ods of rough sets and Bayesian inference to a breast cancer detection us- 
ing electro-potentials. The statistical principal component analysis (PC A) 
and the rough sets methods were applied for feature extraction, reduc- 
tion and selection. The quadratic discriminant was applied as a classifier 
for a breast cancer detection. 



1 Introduction 

An investigation of a breast model indicates variations in epithelial electropo- 
tentials that may occur in the area of abnormally proliferating cells in vicinity 
of neoplazm. From observation that altered skin surface electropotentials may 
be caused by presence of underlying abnormal proliferation, and idea of cancer 
detection by measurements and recognition has been formed (Crowe and Faupel 
1996; Long et al. 1996; Davies 1996; Dixon 1996). 

This technique is based on the processing and recognition of electropotentials 
(EPs) measured by an array of sensors in the suspicious regions of a breast 
(Crowe and Faupel 1996; Long et al. 1996). 

The cancerous tissue influences metabolic, chemical, ionic and thus electromag- 
net ical processes of a breast (Crowe and Faupel 1996; Long et al. 1996; Davies 
1996; Dixon 1996). 

These complex processes in a healthy and cancerous breast can be modeled based 
on the classical flndings in the system and model theories. A discovery of differ- 
ences of EPs readings for a healthy and a cancerous breast gives an outstanding 
chance to design a device for a breast cancer detection using non invasive EPs 
measurements (Crowe and Faupel 1996; Long et al. 1996; Davies 1996; Dixon 
1996). 

It is generally known how cells develop and stabilize (maintain) an ionic gradient 
(difference) which results in electrical potential gradient across a cell membrane 
(Davies, 1994). In a breast the epithelial cells are arranged as a line ducts or 
lobules. They have specialized apical and basolateral domains in order to pro- 
vide basal absorption, secretion and milk production during lactation phase. The 
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membranes of considered cells have different preambilities and transport (transi- 
tion) functions. They have different electrical potentials in respect to each other, 
which results in transepithelial electrical potential. 

It has been observed that the proliferation is increased in cancer cells and 
proliferation becomes disregulated in in the surrounding breast terminal ductal 
tabular units (Davies, 1994). Field (subarea) cancerization results in field depo- 
larization. This eventually extends in a penumbra measurable on a skin surface, 
for example as electro-potential. Different proliferation in healthy and cancerous 
breast cause different depolarizations and thus resulting difference in electropo- 
tentials on a skin surface. This means that there is ability to indirectly measure 
the altered transepithelial electrical potentials using skin electrodes measuring 
electropotentials. This leads to use of comparative measurements of electropo- 
tentials on a skin surface for a healthy and cancerous breast as feature to non 
invasive cancer diagnosis. 

We discuss an application of soft computing methods of rough sets, principal 
components analysis and Bayesian inference to data mining and a breast cancer 
detection using electro-potentials. 



2 Data 

Measurement and processing of breast electropotentials 

Electro potentials EP measurements (signals) 

For each patient the measuring of electrical potentials (EPs) on breast skin spots 
was carried out by the designed system (Long, et ah, 1996; Dixon, 1996; Davies, 
1994). The array of 8 sensors were glued to each breast with one reference sensor 
located on left and right palm. The (2 x (2 x 8)) electro potential signals from 
N sensors = 16 sensors located on breasts were simultaneously measured and 
record. Total number of measurements in one time moment was 32 (for L and R 
reference signal). There were two sets of measurements collected simultaneously: 
left palm and right palm relative. Only signals referring to the left palm were 
considered in recognition task. The gathered data were like follows. For each jth 
patient [j = 1, 2, • • • , TV), with one (left) palm reference, there were 16 time-series 
simultaneously measured and recorded for each ith sensor. Each time-series of jth 
patient for ith sensor contained raw electropotentials measured in Nep = 150 
time moments (with a sampling time At = 0.2 sec), forming a data set 

PoMent^^" = x-^’'(l),x-^’"(2), • • ^ j = 1, 2, • • • , TV, i = 1, 2, • • • .Nsensors 

... 6 ) 

Measurements may contain disturbances, even though, that it is required that 
measuring device is properly designed, calibrated, with insulation of leads and 
electronics (but not insulated sensors), without internal noise or oscillations. 
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Sources of disturbances: 



1. Not accurate experiment procedure (sequence, conditions). 

2. Noisy environment (disturbing external devices). 

3. Patient move, talk, sneezing, coughing, etc). 

4. Loosing connection of sensor with the skin due to rapid patient move. 

5. Deodorant used by patient. 



Preprocessing of raw EP data 



Due to measurement disturbances and time dependent dynamics of sensor sig- 
nals, the raw EP signals are preprocessed. The preprocessing includes filtering 
and averaging combined with outlier removal. Each sensed spot of breast is ini- 
tially represented by a time-series of 150 noisy raw EP signals. The two stage 
averaging process, combined with outliers removal, was used to find representa- 
tive EP for a sensor. The final average of time-series (after outlier removal) is a 
representation of a EP in measured skin spot. 



The resulting average was considered as representative electropotential for 
a spot related to the attached ith sensors for jth patient. The collection of all 
averaged N sensors electropotentials for jth patient can form an averaged EP 
pattern 









sensors 



( 2 ) 



Scaling of the averaged EP data 

Some data sets may include data which should be a subject of scaling and nor- 
malization. These operations are suggested in the front of some processing stages 
(for example predictor design). This is needed to remove biases. Eor example, 
the usual goal is to classify the same type of signals or data and classification 
must be for example invariant to the shift of attribute values. 



We have scaled each attribute x separately according to the the standard nor- 
malization formula. Data sets and patterns 



The averaged EP data set used for classification, contained EP patterns with 
only 14 elements (7 electropotentials for each breast, with one unsufRcient re- 
moved) and the corresponding class (bening - class 0, malignant class 1). The 
EP data contains cases for each jth patient (x-^ , class^) with 14 — elements raw 
EP patterns x^ and associated class. The corresponding patients’ IDs and age 
Were stored in a separate file. 



Features. Differentials 



The initially preprocessed EP data contains averaged EP patterns including 
14 electropotentials (averaged) from both breasts. Erom these patterns created 
features have been formed assuming that these features will be more relevant 
for classification. Eurthermore some of them may have close medical interpre- 
tation used in medical diagnosis. The all concept of cancer recognition is based 
on observation that electropotentials in specific spots on cancerous breast will 
be different (larger) that corresponding electropotentials on the healthy breast. 
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Thus, differences between corresponding electropotentials can be considered as 
major detection features. However, this may depend on data type and clas- 
sifier design. The original EP data carries on the same information as differ- 
entials. Creation of new features is a kind of transformation = T(x-^) of 
^sensors ~ dimensional raw EP patterns x-^ = ^ ^j,iVsensorsjT 

^differentials = n — dimensional differential space patterns 

yj = . . ^yJ,n]T (^3^ 

Differentials are created feature from the original averaged EP data set con- 
taining 14 — element EP patterns. Differentials are linear combinations of EPs 
features and other defined differentials. There are within breast^ between breast 
or mirror types of differentials. We can list some examples of differentials: symp- 
tomatic breast differentials j symptomatic breast high^ etc. (Long, et ah, 1996; 
Dixon, 1996; Davies, 1994). Some of differentials have medical meanings. Eor ex- 
ample: a measure of local depolarization for the symptomatic breast (estimated 
by from differences between sensors EPs in the symptomatic breast). 

The considered data set with differential contains cases (y-^ , class^) where 28 ele- 
ment pattern contains age code 27 differential features ^ 

The numerical code for an age was defined as: age 1: age < 35 code 1; age 2: 
35 < age < 65 code 2; age 3: age > 65 code 3. 

Feature extraction, reduction and optimal selection by principal com- 
ponent analysis (PCA) 

Previous study has concerned mostly classification of 28 selected differentials de- 
rived from the averaged EPs. Experiments and research show that decorrelating 
and orthogonalizing transformations may provide better representation of data 
patterns with stronger discernibility and classificatory power. Additionally the 
reduction of data, preserving classification accuracy, based on Occam razor and 
Rissanen minimum description length law, may lead to better generalization 
of designed classifier for unseen future patterns. We applied Karhunen-Loeve 
transformation KLT (resulting from the Principal Component Analysis PCA) 
for 28 — elements differential patterns (with an age code) as feature extraction 
and reduction. This transformation allows us to discover most expressive pattern 
features and hidden (latent) variables. 

An application of rough sets for feature reduction, classifiability and 
dependency discovery 

We used the method of rough sets (Pawlak, 1991) for selection features from 
patterns obtained as a result of the KLT transformation. Rough sets allow us 
to find from the data so called reducts sets. Each reduct is a satisfactory set of 
features allowing proper classification of the training set (without loosing a clas- 
sification accuracy). Since for a given data set rough sets may generate several 
reducts, we provided a technique of selecting the best reduct guaranteeing best 
generalization based on statistical processing. The final classifiers was designed 
for reduced data for best pattern features defined by selected best reduct. Ad- 
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ditionally, rough sets allow to find most important pattern features defined by 
so called core, being a set intersection of all reducts for a given data. 

Classification 

The goal of classification is to classify the patients, being preliminary recognized 
by physical examination as having a cancer, into two distinct categories (classes) 

— benign (category 0) 

— malignant (category 1) 

From the medical point of view it is important to obtain especially maximal rate 
of confidence in detecting of the malignant cancer. The designed classifier has to 
be tested and their performance evaluated. This will give an information about 
the accuracy of recognition of new objects. We will discuss some of classifier’s 
performance evaluation criteria in the next section. 

Classification accuracy measures 

Assume that a test set Ttest (not used for a training) consists of Ntest cases 
(patients) with Ntest , m ~ frne malignant cases, Ntest , b ~ fme benign cases, with 

Ntest,M T Ntest,B Ntest • 

Assume that a classifier recognized correctly Ntest,M,correc malignant cases 
and N test, B, cor rec bening cases. This means that following number of malignant 
cases N test, M,nocor rec Ntest, M N test ,M ,eorree nnd bening cases Ntest, B ,noeorree ~ 

Ntest,B — Ntest,B,eorree Were recognized not correctly. The total accuracy of clas- 
sification can be measured by 

, number of all correctly classified cases 

Jclass = r — ^ 100% = 4 

number oi all cases 



Ntest, M,eorree T Ntest, B ,eorree 



X 100% 



To evaluate the malignant classification accuracy the sensitivity (malignant) 
measure is defined 

number of correctly classified malignants cases ^ , 

Jsens = ^ ^ X 100% = 5 

number ol true malignant cases 



^ test ,M ,eorree 



X 100 



To evaluate the bening classification accuracy the specificity (bening) measure 
is defined as 

, number of correctly classified bening cases ^ 

■hpec = r vf ^ X 100% = 

number ol true bening cases 



^test,B,eorree 



X 100% 
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The results of the Bayesian quadratic discriminant based clsissifier 

First, we have considered the quadratic Gaussian discriminant based classifier. 
We have assumed of multivariate normal Gaussian distribution of the feature 
vector X within each class. The vector form of the Gaussian distribution of the 
probability density function p(x|C^), for the feature vector x within a class Q, 
is given by the equation: 



,1 exp --(x-^j'^SO(x-^J 

(27r)2|Si|2 L 2 



(7) 



where is the mean vector of the ith class feature vector, is the ith class 
feature vector covariance matrix, |S^| is the determinant of the covariance ma- 
trix, and n is the dimension of the feature vector x. Based on the Bayes’ optimal 
decision rule the following a quadratic discriminant can be derived 



dRx) = In ■ 



1 



(2^)2|S,|i 



-exp 






+ lnP{ci), i= 1,2,- ■ ■ ,l 

(8) 



Additional derivations give the following discriminant function 



di(x) = -i/n |Si| - i(x- ^j'^SP(x- + /n P(ci), i = (9) 

The quadratic Gaussian discriminant classifier was designed and experiments 
were performed for the training and test sets containing patterns in the dif- 
ferentials feature space. In the classifier design we decided consideration of the 
unknown category. The unknown category was detected when the absolute dif- 
ference between discriminant for the benign and the malignant is less than a 
given threshold (from the range [0,1]). Malignant cases were favored by the fac- 
tor wrn We have found a proper value for wrn as 0.9. The following data sets 
were analyzed: 



1. The training set Ttrai for designing the discriminant parameters, containing 
615 cases with patterns constituted with the age code plus 27 differentials: 
469 cases of class 0 (benign), 146 cases of class 1 (malignant). 

2. The test set Ttestj for the classifier performance evaluation, contained 1301 
cases, with patterns containing the age code plus 27 differentials: 968 class 
0 (benign), 333 class 1 (malignant). 

The results of three experiments are shown in Table 1 . Experiment 1 - malignant 
class favorized by the weight wrn = 0.9, with threshold thu = 1.0. Experiment 
2 - malignant class favorized by the weight wrn = 0.9, with threshold thu = 0.8. 
Experiment 3 - malignant class favorized by the weight wrn = 0.9, thu = 0.5. 
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Table 1 Results of differential classification by Gaussian quadratic discriminant 

based classifier) 



Experiment 


All misclassified % 


Sensitivity % 


Specificity % 


Unknown % 


1 


31.83 


93.70 


26.45 


14.30 


2 


32.21 


93.12 


24.38 


9.76 


3 


33.39 


92.53 


21.38 


6.81 



Results of Gaussian quadratic discriminant classifier with 28— dimen- 
sional differential pattern transformed into PC A space 

The Gaussian quadratic discriminant classifier was also designed for the trans- 
formed by KLT and reduced patterns. The 28 — dimensional patterns y, in the 
differential feature, were transformed into the principal component space using 
the Karhunen-Loeve transformation (KLT) derived from the training set. First, 
the full size 28 — element PGA feature patterns were obtained as a result of 
KLT transformation. Then the reduction of PGA feature pattern to dimension 
m < n was provided. We present results with 28 — dimensional differential 
pattern transformed into m = 12 dimensional patterns in the PGA space. The 
reduced feature vector ^pcA contained the first 12 principal components (from 
the total number of n = 28 PGAs of the transformed pattern). 

The following data sets were used: 

1. The training set Ttra,PCAj for designing of the discriminant parameters, 
contained 615 cases with patterns elements being the first 12 PGAs of trans- 
formed original training set (containing differentials): 469 class 0 (benign), 
146 class 1 (malignant). 

2. The test set Ttest,PCA^ used for the performance evaluation, contained 1301 
cases, with pattern’s elements being the first 12 PGAs of the transformed 
original test set (containing differentials): 968 class 0 (benign), 333 class 1 
(malignant). 

The results of two experiments are shown in Table 2. Experiment 1 - both classes 
equally treated. Experiment 2 - malignant class favorized. 



Table 2 Results of quadratic Gaussian classifier for PGA features. 



Experiment 


All misclassified % 


Sensitivity % 


Specificity % 


1 


34.43 


81.93 


18.02 


2 


35.71 


94.62 


15.74 



Results of clsissification with the KLT transformation of differential 
patterns with the rough sets based reduction 

The rough sets method was applied for the training set Ttra,PCA containing 615 
cases. Each case contained the transformed (from differentials) by KLT full size 
(n = 28) patterns and a corresponding class. For the discretized training data 
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set the rough sets method was applied, and the reducts set and a core were 
computed. The results of Gaussian quadratic discriminant classifier for selected 
7— element reduct (with corresponding reduced PGA feature patterns) are pre- 
sented in Table 3. 

We present results of two experiments: experiment 1 - both classes equally 
treated, experiment 2 - malignant class favorized. 



Table 3 Results of quadratic Gaussian classifier for PGA features reduced by 

rough sets. 



Experiment 


All misclassified % 


Sensitivity % 


Specificity % 


1 


38.89 


82.89 


17.11 


2 


36.65 


92.45 


17.55 



3 Conclusion 

The classification experiments show the possibility of a breast cancer detection 
using electropotentials. However, the initial results show that this type of data is 
difficult to classify. The application of the principal component analysis improved 
classification accuracy for the breast cancer detection using quadratic Gaussian 
discriminant classifier. Additionally, the rough sets method allowed to reduce 
the feature number without substantial decreasing of the classification accuracy. 
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Abstract. The paper presents a hand- written character recognition by 
the data mining and knowledge discovery software system RoughNeu- 
ralLab. In recognition experiments the Zernike moments were applied 
as the extracted features of character images. For further feature reduc- 
tion the rough set theory method was applied as a front end of neural 
network. Eventually the error backpropagation neural network classifiers 
were designed for the reduced feature subsets. 



1 Introduction 

The contemporary concern about soft computing and knowledge discovery has 
put forward applications of neural processing and useful extensions of elemen- 
tary set theory such as rough sets (Pawlak, 1991; Skowron, 1990). Basically, 
rough sets embody the idea of indiscernibility between objects in a set. Neu- 
ral networks, provide soft computation, like error backpropagation, with gen- 
eralization abilities for processing unseen instances. We combined both ideas 
together, and implemented them as a rough-neural software system RoughNeu- 
ralLab (Swiniarski, 1995) for data mining and pattern recognition. The system 
is especially suitable for processing images and data sets in form of decision 
tables. In this software system, rough sets were utilized to reduce the size of a 
knowledge base without losing valuable information due to the process of reduc- 
tion, while neural networks are used for classification. By applying the rough set 
theory, unimportant knowledge in a decision table can be eliminated and useful 
information (reducts) can be obtained. The reduced decision table, with reduced 
features of a pattern, can be used for a classifier design. 

The paper is organized as follows. In Section 2 we introduce Zernike moments 
as a robust method of feature extraction in terms of rotation invariancy and 
ability of image reconstruction via the inverse transform. Then we outline the 
ideas and notation used in the rough set theory, in knowledge reduction. Even- 
tually, we discuss the software system RoughNeuralLab (Swiniarski, 1995) and 
its usage for data mining and classification. The paper concludes with a dis- 
cussion of experiments with the described system for the handwritten character 
recognition. 



L. Polkowski and A. Skowron (Eds.): RSCTC’98, LNAI 1424, pp. 617—624, 1998. 
(c) Springer- Verlag Berlin Heidelberg 1998 
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2 Feature extraction by complex Zernike moments 

2.1 Introduction 

A robust recognition system must be able to recognize an image regardless of its 
orientation, size and position. In other words, rotation- j scale- ^ and translation- 
invariancy properties for the extracted features are desirable. Historically, Hu 
(1962) first introduced the use of image moment invariants for two-dimensional 
pattern recognition applications in 1961. He derived a set of invariant moments 
which has the desirable properties of being invariant under image rotation, scal- 
ing, and translation. However, the question of what is gained by including higher 
order moments in the analysis of an image has not been addressed. 

Teague (1980) has suggested the notion of orthogonal moments to recover the 
image from moments based on the theory of orthogonal polynomials, and has 
introduced Zernike moments. This approach is usually simple and in a straight- 
forward manner allows moment invariants to be constructed of an arbitrarily 
high order. In addition, Zernike moments possess a useful rotation invariance 
property. Rotating the image does not change the magnitudes of the moments. 

Since the defined features by means of Zernike moments are only rotation in- 
variant, to obtain the scale and translation invariance, an image must be normal- 
ized via image normalization. Before giving the idea of the image normalization 
process we briefly describe translation^ scaling^ and rotations. 

A gray-scale spatial domain image can be defined as 

{f{x,y) e {0,1,. ..,255} : X = 0,1,...,M - l;y = 0,1,..., IV - 1}, (1) 

where x is a column index, y a row index, M a number of columns, N a number 
of rows, and f[x^y) the pixel value at location {x^y). 

The proposed Zernike moments are only rotationally invariant, but the con- 
sidered images have scale and translation differences as well. Therefore, prior to 
extraction of Zernike moments, these images should be normalized in respect to 
scaling and translation. 

Translation invariance can be achieved by moving the origin to the center of 
an image. In order to get the centroid location of an image, general moments 
(or regular moments) have been utilized. General moments are defined as 

/ OO pOO 

/ x^y'^f{x,y)dxdy (2) 

-OO j — OO 

where nipq is the (p + q)th order moment of the continuous image function 
f{x^y). Since we are dealing with digital images, the integrals can be replaced 
by summations. Given a two-dimensional M x N image, iripq becomes 

M-l N-1 

^pq= X] Z, ( 3 ) 

ai=0 y=0 

To keep the dynamic range of iripq consistent for any different size of images, 
the M X N image plane should be first mapped onto a square defined hy x G [-1, 
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+1], y G [-1, +1]. This implies that grid locations will no longer be integers but 
will have real values in the [-1, +1] range. This changes the definition of nipq to 

+1 +1 

^ (4) 

x=—l y=—l 



Now we can find the centroid location of an image by the general moment. 
According to Zernike (1934), the coordinates of the image centroid {x^y) are 



X 



mio _ _ mpi 
^00 ’ ^00 



( 5 ) 



To achieve a translation invariancy, Hu (1980) suggested that we can trans- 
form the image into a new one whose first order moments, mpi and mio, are 
both equal to zero. This is done by transforming the original image into the 
f[x x^y y) image, where x and y are the centroid locations of the original 
image. Let g{x^y) be the translated image, the new image function becomes 



g{x,y) = f{x + x,y + y) 



(6) 



Scale invariancy is accomplished by enlarging or reducing each image such 
that its zero order moment, mpo, is set equal to a predetermined value j3. We 
achieve this by transforming the original image function f{x^y) into a new func- 
tion f{x/a^y/a)^ with the scaling factor a, where 



a = 




( 7 ) 



Let g{x^y) be the scaled image. After scale normalization, we get 

g{x,y) = ( 8 ) 

a a 

Hence, 

g{x,y) = f{- + X,~ + y) (9) 

a a 

with {x^y) the centroid of f[x^y) and a = p a predetermined value, 

normalizes a function with respect to scale and translation. 



2.2 Zernike moments 

Zernike (1934) introduced a set of complex polynomials which form a complete 
orthogonal set over the interior of unit circle, i.e., = 1. Let the set of 

these polynomials be denoted by V^px^y). The form of these polynomials is: 

^n/(^,Z/) = L^/(psin6>,pcos6>) = V^pp^O) = Rni{p)exp{ilO) (10) 



where 
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n Positive integer or zero. 

I Positive or negative integers, subject to constraints n — \l\ = even, 
|/| < n. 

i The complex number 

p Length of vector from origin to (x,y) pixel. 

0 Angle between vector p and x-axis in counterclockwise direction. 
Radial polynomials Rni{p) are defined as 



Rnl{p) — 




( — l)^[(n — 



(11) 



These polynomials are orthogonal. 

Zernike moments are the projection of the image function onto these orthog- 
onal basis functions. The Zernike moments of order n with repetition I for a 
continuous image function /(x,y) is 

/ f f{x,y)[Vnl{p,0)]*dxdy={An-l)* (12) 

TV J Jx‘^+y‘^<l 

For a digital image: 

Anl=^^^-^^^I{x,y)V*i{p,d)] (13) 

X y 



or 



Gnl 

Snl 



2n T 2 



IL 



X^+y2<l 



f{x,y)Rni{p) 



cosW 
— sin W 



dxdy 



Cfil 2 Jdc (^jAyiI 
2n + 2 



7T 



^nl — 2t 1 th (A 77 ,/) 
— 2n — 



/ / f (x,y) Rni{p) cos W dxdy 

J Jx^+y^<l 

{Anl) 

- Rnl{p) sin W dxdy 

d d 



and for I = 0 



^nO d^nV) [ [ f {p^ iy^ RfiVii^P^dxdy 

TV J Jx^+y^<l 



SnO =0 



For a digital image, when I ^ 0 



Cnl 

Snl 



Pr> -U 9 

£ £ f{x^y)dini{p) 

x=—l y=—l 



COS W 
— sin W 



(14) 



(15) 



(16) 



(17) 



(18) 
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and when I = 0 



CnO 



^nO 

0 



^ +1 +1 

f{x,y)Rno{p) 

x=—l x=—l 



(19) 



3 Rough sets in knowledge reduction 



The rough set theory has been developed for knowledge discovery in experi- 
mental data sets. Rough set techniques reduce the computational complexity of 
learning processes and eliminate irrelevant attributes making knowledge discov- 
ery more efficient. Rough sets were introduced by Zdzislaw Pawlak to provide 
a systematic framework for studying imprecise and incomplete knowledge. The 
rough sets theory based on the concept of an upper and a lower approximation 
of a set, the approximation spaee and probabilistic and deterministic models of 
sets. The rough sets theory deals with information represented by a table called 
an information system. This table consists of objects (or cases) and attributes. 
The entries in the table are the categorical values of the features and possibly 
categories. We use the standard notation (see Pawlak (1991) i.e. in particular 
we denote by IN D[S) the indiscernibility relation of the information system S; 
by [x]a the equivalence class defined by IND[A); by AA, AX the lower and 
upper approximation of X with respect to A; by Xi the i-th decision class of the 
decision table DI\ 

The accuracy of an approximation of set X by the set of attributes A is 
defined as follows: 



aA{X) 



card AX 
card AX 



(20) 



The rough (A-rough) membership function of the set X is defined as: 






card{[x]A H X) 
card[[x]A) 



( 21 ) 



The accuracy of approximation of clsissification r = W,W,...Wn by 
the set of attributes A is defined as follows: 

_ Xj^icard {AXj) 

^ Uf^^card {AXi) 

The process of finding a smaller set of attributes than original one with same 
classificatory power as original set is called attribute reduetion. A reduct is the 
essential part of an information system i.e. a minimal attribute subset discerning 
all objects discernible by the original information system. A core is a common 
part of all reducts. The set of all reducts in S with the attribute set A is denoted 
by RED{A). 



( 22 ) 
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Given a decision table S with condition and decision attributes Q = C U Dy 
for a given set of condition attributes ^ C C we can define a positive region A 
POSa{J-)) in the relation IN D{D)y as 

POSa{D) = \J{AX\X e 1ND{D)} (23) 

The positive region POSa{P) contains all objects in U which can be classified 
without error (ideally) into distinct classes defined by IN D[D) based only on 
information in the relation IND{A), 

The cardinality (size) of the ^-positive region of B is used to define a measure 
(a degree) 7 a (^) of dependency of the set of attributes B on A: 

We say that the set of attributes B depend on the set of attributes ^ in a degree 

One can find a reduct of condition attributes relative to the decision attributes 
by removing superfluous condition attributes, without losing the classification 
power of the reduced decision table. The core of decision table is the intersection 
of all its relative reducts. It consists of all indispensable condition attributes. 

3.1 Clsissification results 

In numerical classification experiments, we used the character image database 
which is a collection of digits selected from hand-written Zip Codes of 49 different 
writers from National Institute of Standards and Technology, formerly National 
Bureau of Standards. This database is a subset of “NIST Special Database 1”. 
These characters have been isolated and specially normalized to be 32 x 32 pixel 
images. From this image database, 500 images were selected as our training set 
and 200 images (different from the 500 images) were randomly selected as our 
test set. 

In the numerous numerical experiments, the image data sets were divided 
into two sets, the training image data set and the testing image data set. The 
images of the training data set are the prototypes of the images that we intend 
to train in the system while the images of the testing data set are randomly 
selected from images of the database. 

For generating the training and test sets, from the training image data set, 
there were several selection options including: image thinning, moments extrac- 
tion, reduct computation, etc. 

We tested the recognition system with two testing image data set. The orig- 
inal testing image data set is randomly selected from the images database. The 
second testing image data set contains contaminated images by injecting some 
noise to original images. 

We considered 500 data images for training because of the limited system 
capacity. The training image data set was transformed into the decision table 




Rough Sets and Neural Networks Application 623 



set by several ways described in Table 1. One of the input data sets was not 
thinned, another was thinned before the general moments were applied and the 
other was thinned after the general moments had been extracted. Each of these 
input data sets was the source of Zernike moments and the system generates a 
decision table from the Zernike moments by two selections, the domain and the 
range. We select 5 as the domain and 1.0 as the range by our experience. Among 
the attributes of the decision table, several attributes are selected as the final 
input data set by one of the reduct sets. So we select two reduct sets for getting 
the two final input data sets. One of the selected reduct sets was reduct set, 
that has the attributes showing more then 40% in all reduct sets, while another 
set is consisted of the smallist number of attributes among all reduct sets. The 
error backpropagation neural networks were trained with one hidden layer or 
two hidden layers. Therefore, the system is trained by 12 different types of input 
data sets or neural network selections in Table 1. 



Table 1. The user’s selection options for the training sets 





1 The options of training sets | 


Option 


1 Thinning 


Redact Sets 


Hidden layers 


Step 


1 Preprocessing 


Classifier 


Substep 


Thinning 


Rough Sets 


Neural Networks 


Selections 


no, before or after General Moments 




1,2 


System 1 


no 


^st 


1 


System 2 


before 


^st 


1 


System 3 


after 


^st 


1 


System 4 


no 




1 


System 5 


before 


Qth 


1 


System 6 


after 


W 


1 


System 7 


no 


^st 


2 


System 8 


before 


^st 


2 


System 9 


after 


^st 


2 


System 10 


no 




2 


System 11 


before 




2 


System 12 


after 


w 


2 



The testing image data set, with 200 images, was divided into two sets, one 
was the original testing image data set and the other contained injected 3% of 
global noise (injected noise on any pixels of the original image). 

For the systems 1,3,4,6,7,9,10 and 12, the value 12 was selected as the ex- 
perimental order a and the value 320 as the experimental pixel number j3. For 
the system 2,5,8 and 11, 12 was selected as the experimental order a and 65 as 
the experimental pixel number j3. For the test set 1, the best classification result 
of 84.0% was obtained for the system 9 with thinned images (after the general 
moments were extracted and applied for transformation) and the neural network 
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with 2 hidden layers. The best result of classification of 79.5% for the testing set 
2 with injected noise to images was obtained for the system 7 with the training 
set of unthinned images and the neural network with 2 hidden layers. 

Conclusions. The objective of the research was to build an image classification 
system by using neural networks based on invariant Zernike moments for feature 
extraction and rough set tools for feature reduction. In the best experiment set- 
ting we obtained 84% classification accuracy for handwritten digits recognition 
task. Through implementation of the rough sets theory, we could significantly 
reduce the number of information about images and improve the task running 
time to one-fourth of that for the regular system. 
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